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Preface 



This volume contains the papers accepted for presentation at the 3rd Workshop 
on Algorithm Engineering (WAE’99) held in London, UK, on July 19-21, 1999, 
together with the extended or short abstracts of the invited lectures by Andrew 
Goldberg, Bill McColl, and Kurt Mehlhorn. WAE is an annual meeting devoted 
to researchers and developers interested in the practical aspects of algorithms 
and their implementation issues. Previous meetings were held in Venice (1997) 
and Saarbriicken (1998). 

Papers were solicited describing original research in all aspects of algorithm 
engineering including: 

— Implementation, experimental testing, and Rne-tuning of discrete algorithms 

— Development of software repositories and platforms which allow use of and 
experimentation with efhcient discrete algorithms 

— Methodological issues such as standards in the context of empirical research 
on algorithms and data structures 

— Methodological issues involved in the process of converting user requirements 
into efhcient algorithmic solutions and implementations 

The program committee selected 24 papers from a total of 46 submissions. 
The program committee meeting was conducted electronically from May 6 to 
May 16, 1999. The criteria for selection were perceived originality, quality, and 
relevance to the subject area of the workshop. Considerable effort was devoted 
to the evaluation of the submissions and to providing the authors with helpful 
feedback. Each submission was reviewed by at least three program committee 
members (occasionally assisted by subreferees). However, submissions were not 
refereed in the thorough way that is customary for journal papers, and some 
of them represent reports of continuing research. It is expected that most of 
the papers in this volume will appear in finished form in scientific journals. A 
special issue of the ACM Journal on Experimental Algorithmics will be devoted 
to selected papers from WAE’99. 

We would like to thank all those who submitted papers for consideration, as 
well as the program committee members and their referees for their invaluable 
contribution. We gratefully acknowledge the dedicated work of the organizing 
committee (special thanks to Tomasz Radzik and Rajeev Raman, who did most 
of the work), the support of the Department of Computer Science at King’s 
College, and the generous help of various volunteers: Gerth Brodal, Sandra 
Elborough, Ulrich Endriss, Viren Lall, Jose Pinzon, and Naila Rahman. We 
thank them all for their time and effort. 
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Selecting Problems for Algorithm Evaluation 



Andrew V. Goldberg* 

Inter Trust STAR Lab., 460 Oakmead Parkway, Sunnyvale, CA 94086, USA 
goldbergSintertrust . com 



Abstract. In this paper we address the issue of developing test sets for 
computational evaluation of algorithms. We discuss both test families for 
comparing several algorithms and selecting one to use in an application, 
and test families for predicting algorithm performance in practice. 



1 Introduction 



Experimental methodology is important for any experimental science. Several 
recent papers address methodology issues in the area of computational 

evaluation of algorithms. A recent CATS project Q3 is aimed at standardizing 
computational test sets and making them readily available to the research and 
user communities. In this paper we discuss how to design and select problems 
and problem families for testing general-purpose implementations, i.e., imple- 
mentations designed to be robust over a wide range of applications. 

Major goals of experimental evaluation of algorithms are to 



— determine relative algorithm performance. 

— explain observed performance and find bottlenecks. 

— facilitate performance prediction in applications. 



We address the issue of designing computational experiments to achieve these 
goals. 

A comparative study determines the best algorithm or algorithms for the 
problem. When no single algorithm dominates the others, one would like to 
gain understanding of which algorithm is good for what problem class. The best 
algorithms warrant more detailed studies to gain better understanding of their 
performance and to enable prediction of their performance in applications. 

Theoretical time and space bounds are usually worst-case and hold in ab- 
stract models. For many algorithms, their typical behavior is much better than 
worst-case. Practical considerations, such as implementation constant factors, lo- 
cality of reference, and communication complexity, are difficult to analyze exactly 
but can be measured experimentally. Experimentation helps to understand algo- 
rithm behavior and establishes real-life bottlenecks. This can lead to improved 
implementations and even improved theoretical bounds. 

* Part of this work has been done while the author was at NEC Research Institute, 4 
Independence Way, Princeton, NJ 08540. 
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Significant investment of resources may be required to develop a software 
system. Thus one would like to be able to determine in advance if the system 
performance will meet the requirements. Usually at the system design stage one 
has an idea of the size, and sometimes the structure, of the subproblems which 
need to be solved by the system. To obtain performance estimate, one needs 
to estimate the performance of subroutines for these subproblems. This task is 
simple if one can generate the subproblems and has good implementations of 
the subroutines available. Getting the system to the state when it can produce 
the subproblems, however, may require implementing a large part of the system 
and may be time-consuming. Obtaining subroutine implementations may also 
be hard: for example, one may be unwilling to purchase an expensive program 
until it is clear that the system under development is feasible. Computational 
studies may allow one to obtain good performance estimates before a system is 
build. 

As algorithms and hardware improve and new applications arise, computa- 
tional tests need to evolve. Although one can predict some of the future trends 
(e.g. increased memory and problem size) and prepare for these, one cannot 
predict other developments (e.g. radically new algorithms). 

Designing a good computational study is an art that requires detailed knowl- 
edge of the problem in question as well as experimental feedback. Although we 
cannot give a formula for such a design, we give general principles and helpful 
ideas, and clarify the experiment design goals. 

This paper is based on our experience with computational study of graph 
algorithms. However, most of what we say should apply to other algorithms as 
well. 



In this paper we restrict the discussion to the test set design. Many important 
issues, such as test calibration and machine-independent performance measures, 
fall out of scope of this paper. We also do not discuss architecture dependence, 
although algorithm performance and the relative performance may depend on 
the architectural features such as cache size, floating point performance, number 
of processors, etc. However, we feel these issues are of somewhat different nature 
than the ones we address here, except for the problem size selection issues. For 
a discussion of caching issues, see e.g. 



!i lu m 



This paper is organized as follows. We start with a brief discussion of cor- 
rectness testing in Section Q Then we discuss real-life vs. synthetic problems 
in Section Q In Section Q we discuss why and how to generate problems with 
a desired solution structure. Next, in SectionQ we focus on problem structure 
and its relationship to randomization. In Section Q we discuss the importance 
of using both hard and easy problems in computational tests. In Section Q we 
introduce the concept of an algorithm separator that is important for compara- 
tive evaluation of algorithms. We discuss parameter value selection in Section 0 
In Section 0 we discuss standard test families. In Section ^^we point out how 
to use an experimental study to select the algorithm for most appropriate for 
an application. Section ^Jgives a summary of the proposed experiment design 
process. 
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2 Testing Correctness 

The first step in testing a subroutine is to establish its correctness. A good col- 
lection of test problems is helpful here. Such a collection should include problems 
of different types. Problems that demonstrated buggy behavior during algorithm 
development process may be especially useful is subsequent projects. 

A related set of tests is for implementation limits. The limits can be on the 
problem size, parameter values, or a combination of problems size and parameter 
values. For example, in the context of the shortest path problem, if one uses 
16-bit integers for the input arc lengths and 32-bit integers for the distances, 
overflows are possible if the input graph has a path with more than 2^® arcs. 

3 Real-Life and Synthetic Problems 

A good library of real-life problem instances can be very useful. Such a library 
should cover as many different applications as possible. For each application, it 
is desirable to have a wide range of problem sizes for asymptotic performance 
estimation. It is also desirable to have enough problems to be able to make 
meaningful statistical conclusions. 

Real-life problem structure can differ significantly from the structure of natu- 
ral synthetic problems. This can lead to different algorithm performance. Often, 
real-life problems are very simple. For example, in a study the minimum test 
set problem, Moret and Shapiro ^3 observe that real-life instances from several 
applications are easy while random instances are hard. 

Some real-life instances are hard for some algorithms. For example, for the 
multicommodity flow implementation of Zl 1 1 that uses Relax Q as a minimum- 
cost flow subroutine, synthetic problems were easy, but a small real-life problem 
turned out to be hard. This was due to the very poor performance of Relax on 
some subproblems of the real-life problem. 

In some application areas, real-life problems are difficult to find. Many com- 
panies consider their problem instances proprietary and do not make them pub- 
lically available. Only limited number of large problem instances can be main- 
tained due to storage limitations. When only a small number of real-life problems 
is available, perturbations of these problems can be used as “almost” real-life 
problems. 

Even when real-life problems are available, there are good reasons for using 
synthetic problems as well. Problems can be generated with desired structure 
and parameter values. Synthetic problems also anticipate future applications and 
explore algorithm strengths and weaknesses. In the rest of the paper we discuss 
how to generate interesting problem instances. 

4 Solution Structure 

Problem solution structure is very important. This observation is implicit in 
many computational experiments, but first stated explicitly by Gusfield ^3. For 
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some algorithms, solution structure is closely related to algorithm performance. 
For example, the Bellman-Ford-Moore algorithm for th® shortest path 

problem usually takes much more time on networks with deep shortest path trees 
than on the same size networks with shallow trees. 

One can generate a problem with the desired solution structure by starting 
with a solution and then perturbing it and adding data to “hide” the solution. 
This approach is usually easy to implement and can lead to interesting problem 
classes. Starting from a solution that ends up as an optimal solution to the 
generated problem also guarantees that the problem is feasible and facilitates 
code correctness testing. 

For example, one can generate a single-source shortest path problem as fol- 
lows. First, generate a tree rooted at the source s, assign tree arc lengths, and 
compute, for every vertex v, the distance d{v) from s to u in the tree. Then 
generate arcs {x, y) to obtain a desired graph structure. Assign the lengths of 
{x, y) to be d{y) — d{x) + e{x, w), where e{x, y) > 0. For any positive function e, 
the tree is the shortest path tree. If the values of e are very small compared to 
the tree arc lengths and the number of additional arcs is large, then many trees 
will give distances close to the shortest path distances, “hiding” the optimal 
solution. Note that the tree arc lengths and distances may be negative, but the 
network we construct does not have negative cycles. This is a very simple way 
to construct a network with cycles, negative-length arcs, but no negative cycles. 

Sometimes a natural problem structure implies a solution structure. For ex- 
ample, a shortest path tree in a random graph with independent identically 
distributed arc lengths is small. One needs to check the solution structure to 
determine solution properties. For example, if solutions to a problem family are 
simple, one should be aware of it. 



5 Problem Structure and Randomness 



Problems with a simple and natural structure, such as random graphs and grids, 
are easy to generate. See more natural graph classes. Such problems also 

can resemble real-life applications. For example, minimum s-t cut problems on 
grids have been used in statistical physics (e.g. ^3^3) and stereo vision 
Shortest path problems on grids arise in an inventory application m , one about 
which we learned only after using such problems in a computational study Q. 

If the problem structure is simple, algorithm behavior is easier to understand 
and analyze. Synthetic problems with simple structure often are computation- 
ally simple as well. Not all problems with a simple structure are computationally 
simple. For example, the bicycle wheel graphs are difficult for all current 

minimum cut algorithms. When natural problems are easy, this does not neces- 
sarily make these problems uninteresting: many real-life problem instances are 
computationally simple, too. 

Sometimes one needs to use more complicated problem structures. In some 
cases, a complicated structure may better model a real-world application. Simple 
structures may also allow special-purpose heuristics that take advantage of this 
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particular structure but do not work well on other problem types. An example of 
a generator that produces fairly complicated problems is the GOTO generator 
for the minimum-cost flow problem ^3- Problems produced by this generator 
seem to be difficult for all currently known combinatorial algorithms for the 
problem. 

Some synthetic problem families, especially worst-case families, are deter- 
ministic. At the other extreme, some problems are completely random. Most 
problem families, however, combine problem structure with randomness. For 
example, the Gridgen generator 0 outputs two-dimensional grid graphs with 
random arc weights. 

Randomness can come into a problem in several ways: it may be part of the 
problem structure (e.g. random graphs), parameters (such as arc weights in a 
graph) may be selected at random from a distribution, and a problem can be 
perturbed at random. 

The latter can be done for several reasons. One reason is to hide an optimal 
solution. Another is to use one problem instance (e.g. a real-life instance) to 
get many similar instances. Finally, one can perturb certain parameters to test 
algorithm’s sensitivity to these parameters. 

6 Hard and Easy Problems 

Hard problem families, in particular worst-case families, are interesting for sev- 
eral reasons. They show limitations of the corresponding codes and give upper 
bounds on how long the codes may take on a problem of a certain size. When 
a hard problem for one code is not hard for another, one gets an “algorithm 
separator.” See Section 0 

Studying a code on hard problems may give insight into what makes the code 
slow and may lead to performance improvements. Profiling localizes bottlenecks 
to be optimized at a low level. The ease of designing hard problem families for 
a certain code is sometimes closely related to the code’s robustness. 

Next we discuss easy problems. We say that a problem family is general if 
problems in this family cannot be solved by a special-purpose algorithm. By 
easy problem families (for a given code) we mean problem families on which 
the code’s performance compares well with it’s performance on other problem 
instances of the same size. Easy problem families which are also general are more 
interesting than the non-general ones. 

Easy problems give a lower bound on the code’s performance. Bottlenecks 
for easy problems often are different from the bottlenecks for hard problems and 
need to be optimized separately. In some cases practical problems are easy, so 
optimizing for easy problems is important. 

Gonsider, for example, a heap-based implementation of Dijkstra’s algorithm 
0. Dense graphs are easy for this implementation. Both in theory and in prac- 
tice, scans of the input graph’s arcs dominate algorithm’s running time on dense 
graphs (see e.g. 0 ). On these graphs, internal graph representation, that allows 
to scan arcs going out of a vertex efficiently, is crucial. Large sparse graphs are 
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hard for this implementation. In this case, heap operations are the bottleneck, 
and efficiency of the heap implementation is crucial for performance. 

Designing hard problem families is not easy. Often, worst-case problem fam- 
ilies for an algorithm are developed to demonstrate analysis tightness. The goal 
is to design a problem family that causes as many bottleneck operations as the 
upper bound of the algorithm analysis allows. See, for example, Q for a worst- 
case example of an algorithm for the maximum flow problem. Another example 
is a graph family of 12 H . where the authors use a graph family on which Prim’s 
minimum spanning tree algorithm performs a large number of decrease-key op- 
erations. 

However, some worst-case examples require a sequence of bad choices on 
algorithm’s part which is unlikely in practice. Furthermore, implementations may 
contain heuristics which improve performance on known worst-case examples. In 
particular, the above-mentioned maximum flow problems of ^ are not that hard 
for the better maximum flow codes studied in [^. This motivated the authors 
of the latter paper to developed a family of problems which are hard for all the 
codes in their study. 

The design of easy problem families is similar to that of hard families. How- 
ever, it is usually easier to design simple problem families. In particular, special 
cases of a problem often lead to simple problem families. 

Hard and easy problems give insight into the problem structure and param- 
eters which are unfavorable or favorable for a code, and help in predicting the 
code’s performance in practice. 



7 Algorithm Separators 



We say that a problem instance separates two implementations if there is a 
big difference in the implementation performance on this instance. Note that 
performance of two codes is indistinguishable unless one finds problems that 
separates them. Such algorithm separator problems (or problem families) are 
important for establishing relative algorithm performance. Ideally, one would 
like to find problem families on which algorithm performance difference grows 
with the problem size. 

Solution structure, problem structure, and certain choices of parameter values 
may lead to separator instances. Easy and hard problems are often separators 
as well. 

Natural algorithm separators are of special interest. Knowing that a natural 
problem structure is hard for one algorithm, but easy for another one, greatly 
contributes to understanding of these algorithms. For example, an experimental 
study of Q shows that certain algorithms, in particular incremental graph algo- 
rithms perform poorly on acyclic graphs with negative arc lengths. 

Note that if the performance of two algorithms is robust, finding a problem 
family that separates these algorithms is hard. If one fails to find a separator for 
two algorithms, this is also interesting, especially if the algorithms are based on 
different approaches. 
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8 Parameter Values 

One would like to understand how algorithm performance depends on the input 
parameter values. Since parameters may be interrelated, one is tempted to test 
all combination of parameters. However, this may be infeasible unless the number 
of parameters is very small. One should try to find out which parameters are 
independent and study performance dependency on parameter subsets. 

Note that one does not need to report in detail on all experiments performed. 
For example, if the results for certain parameter combinations were similar to 
another combination, one can give one set of data and state that the other data 
was similar. The fact that an algorithm performance is robust with respect to 
certain parameter values is interesting in itself. 

Size is one of the parameters of a problem. One would like to know how 
algorithm performance scales as the size goes up, with the problem structure 
and other parameter values staying the same. In such tests, one has to go to as 
big a size as feasible, because for bigger sizes, low-order terms are less significant 
and asymptotic bottlenecks of an algorithm are more pronounced. 

Memory hierarchy is an issue for problem size selection. Small problems may 
fit in cache, making reference locality unimportant. Large problems may not fit 
in the main memory and cause paging. The context of a study determines if one 
should consider such small or large problems. 



9 Commonly Used Problems 

In some areas, computational research produced widely accepted data set for 
algorithm evaluation. Such data sets are important because they make it possible 
to compare published results and to evaluate new algorithms. 

A bad data set, however, may lead to wrong conclusions and wrong directions 
for algorithm development. If the problem classes represented in the data set are 
not broad enough, codes tested exclusively on this data set may become tuned 
to this data set. This may lead to inaccurate prediction of practical performance 
and the wrong choice of the algorithm for an application. 

The 40 Netgen problems ca provide an example for such an event. These 
problems were generated in the early 70’s. For over a decade, most compu- 
tational studies of minimum cost flow algorithms used either these problems 
exclusively, or generated similar problems using Netgen. In the early 90’s, Re- 
lax [^, a minimum-cost flow subroutine, was used in a multicommodity flow 
implementation due to its availability and its excellent performance on 
Netgen-generated problems. (By this time the results of suggested that Re- 
lax is not robust.) The authors of ^3 discovered that on the only real-life prob- 
lem in their tests (with only 49 vertices). Relax performed hundreds of times 
slower than several other codes. This is a good example of the danger of using a 
non-robust code in a new application. 

Designing good common data sets is a very important part of computational 
study of algorithms. Such data sets, however, need to evolve. With bigger and 
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faster computers and faster algorithms, bigger test problems may be required. 
New algorithms many require new easy, hard, and separator problem families. As 
some algorithms become obsolete, some problem families may become obsolete 
as well. 

With the data set evolution, one has to maintain history and justification for 
the changes, so that old results can be compared to the new ones. 



10 Choosing an Algorithm for an Application 

How does one use a computational study to find the best algorithm for an ap- 
plication, or to estimate algorithm’s performance? This is easy to do given both 
codes and typical problems. As we discussed earlier, however, one may want to 
get an estimate without the problems. We describe how to make this decision 
using a computational study. 

First one has to read the computational study and to determine which prob- 
lems or problem family most resembles the application. If the codes and gener- 
ators for the computational study are available and no problem family in the 
study closely resembles the application, one may consider selecting parameter 
values and generating problems to model the applications as closely as possible. 

Then one chooses an algorithm that works well for this problem family. This 
may involve some compromises. In particular, a more robust algorithm may be 
preferable over a slightly faster one, especially if the problem family models the 
application only approximately. Programming ease and code availability may 
also affect the choice. 

To predict the running time on an application problem, one takes the run- 
ning time of the closest available test instance. Note that the time may need 
adjustments to account for the difference in hardware and in problem size. 

As a case study, we consider two papers. The first paper, by Cherkassky et. 
al Q, compares different shortest path algorithms. The second one, by Zhan and 
Noon |82i i studies performance of the same codes on real road networks. The 
latter study contains two problem families, one low detail and another one - 
high detail. These families differ in the road levels included in the graph. 

Road networks are planar, low degree graphs, and among the problems stud- 
ied in Q, grid graphs model them best. Although this data is not available 
in .Tij , it appears that the low detail problems have relatively shallow shortest 
path trees and the high detail problems have deeper trees. We compare the for- 
mer with the wide grid family of Q and the latter with the square grid family. 
Clearly grids do not model road networks in full detail, and because of this the 
algorithm rankings differ. 

Tables Q and Q give rankings of algorithms on the corresponding problem 
families. In the first table, all rankings are within one of each other except 
for GORl and THRESH. Even though the rankings of the two codes differ 
substantially, their performance is withing a factor of two. The rankings are 
closer in the second table: they match except the two fastest codes are switched 
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BFP GOR GORl DIKE DIKED PAPE TWO.Q THRESH 


Wide grids 
Road networks 


4 5 8 7 6 1 2 3 

3 4 5 8 6 1 2 7 



Table 1. Low detail road networks vs. wide grid problem family: algorithm 
ranks. 





BFP GOR GORl DIKE DIKED PAPE TWO.Q THRESH 


Square grids 
Road networks 


7 4 8 6 5 1 2 3 

8 4 7 6 5 2 1 3 



Table 2. High detail road networks vs. square grid problem family: algorithm 
ranks. 



and the two slowest codes are switched. The grid graphs is a fair model for the 
road networks, and lead to a moderately good relative performance predictions. 

Based on the results of Q, one may chose chose PAPE or TWO_Q for a road 
network application. This agrees with the results of 



11 Summary 

In conclusion, we summarize the process of designing a computational study. 
One starts with several natural, hard, and easy problem families and experiments 
with various parameter values to find interesting problem families and to gain a 
better understanding of algorithm performance. Additional problem families may 
be developed to clarify certain aspects of algorithm performance and to separate 
algorithms not yet separated. As a final step, one selects the problem families to 
report on in a paper documenting the research.l6680001.ps The writeup should 
state unresolved experimental issues. For example, if the authors were unable to 
separate two algorithms, they should state this fact. 

As we have mentioned, algorithm test sets needs to evolve as computers, algo- 
rithms, and applications develop. Feedback from applications, both in the form 
of comments and real-life instances, is important for evolution and for improved 
cooperation between algorithm theory, computational testing, and applications. 
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Scalable computing is rapidly becoming the normal form of computing. In 
a few years time it may be difficult to buy a computer system which has only 
one processor. Scalable systems will come in all shapes and sizes, from cheap 
PC servers and Linux clusters with a small number of processors, up to large, 
expensive supercomputers with hundreds or thousands of symmetric multipro- 
cessor (SMP) nodes. An important challenge for the research community is to 
develop a unified framework for the design, analysis and implementation of scal- 
able parallel algorithms. 

In this talk I will describe some of the work which has been carried out 
on BSP computing over the last ten years. BSP offers a number of important 
advantages from the perspective of scalable algorithm engineering: applicability 
to all architectures, high performance implementations, source code portabil- 
ity, predictability of performance across all architectures, simple analytical cost 
modelling, globally compositional design style based on supersteps which facil- 
itates development, debugging and reasoning about correctness. The BSP cost 
model can be used to characterise the capabilities of a scalable architecture in 
a concise “machine signature” which accurately describes its computation, com- 
munication and synchronisation properties. It can also be used to analyse and 
explore the space of possible scalable algorithms for a given problem, taking 
account of their differing computation, communication, synchronisation, mem- 
ory and input/output requirements. Moreover, it can be used to investigate the 
various possible tradeoffs amongst these resource requirements. 
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Stefan Naher and I started the work on LEDA ITT!T1 in the spring on 1989. 
Many collegues and students have contributed to the p roject since then. A first 
publication appeared in the fall of the same year The LEDAbook 

will appear in the fall of 1999 and should be available at WAE99. In my talk I 
will discuss how the work on LEDA has changed my research perspective. 
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Abstract. A new algorithm to compute the K shortest paths (in order 
of increasing length) between a given pair of nodes in a digraph with 
n nodes and m arcs is presented. The algorithm recursively and effi- 
ciently solves a set of equations which generalize the Bellman equations 
for the (single) shortest path problem and allows a straightforward imple- 
mentation. After the shortest path from the initial node to every other 
node has been computed, the algorithm finds the K shortest paths in 
0(m + Knlog(m/n)) time. Experimental results presented in this paper 
show that the algorithm outperforms i n pr actice the algorithms by Epp- 
stein and by Martins and Santos ^3 for different kinds of random 
generated graphs. 



1 Introduction 



The problem of enumerating, in order of increasing length, the K shortest paths 
between two given nodes, s and t, in a digraph G = (V, E) with n nodes and m 
arcs has considerable attention 

The asymptotically fastest known algorithm to solve this problem is due 
to Eppstein ^0. After computing the shortest path from every node in the 
graph to t, the algorithm builds a graph representing all possible deviations 
from the shortest path. Building this graph takes 0{m+nlog n) time in the basic 
version of the algorithm, and 0{m + n) time in a more elaborate but “rather 
complicated” version. Once this deviations graph has been built, the K 
shortest paths can be obtained in order of increasing length, taking O(logfc) 
time to compute the fcth shortest path, so that the total time required to find 
the K shortest paths after computing the shortest path from every node to t 
is 0{m + n + AT log AT). (Eppstein also solves in the related problem of 
computing the unordered set of the AT shortest paths in 0(m -I- n -|- AT) time.) 

Martins’ path-deletion algorithm CH constructs a sequence of growing graphs 
Gi, G 2 , ■ ■ ■ , Gk, such that the (first) shortest path in Gk is the fcth shortest path 
in G. In [], Azevedo et al. avoid the execution of a shortest path algorithm on 
each graph Gk, for fc > 1, by properly using information already computed for 
Gfc_i. In Q, a further computational improvement is proposed that reduces 

* This work has been partially supported by Spanish CICYT under contract TIC-97- 
0745-C02. 
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the number of nodes for which new computations are performed. More recently, 
Martins and Santos in have proposed a new improvement that avoids the num- 
ber of arcs growing up when building Gk from Gk-i, thus reducing the space 
complexity. The total time required by the resulting algorithm to find the K 
shortest paths in order of increasing length after computing the shortest path 
from s to every node is 0{Km). However, in experiments reported by Martins 
and Santos, their algorithm outperforms in practice Eppstein’s algorithm m . 

In this paper, a new algorithm that finds the K shortest s-t paths is proposed. 
The so-called Recursive Enumeration Algorithm (REA), which is described in de- 
tail in SectionQ efficiently solves a set of equations that generalize the Bellman 
equations for the (single) shortest path problem and allows a straightforward 
implementation. The algorithm recursively computes every new s-t path by vis- 
iting at most the nodes in the previous s-t path, and using a heap of candidate 
paths associated to each node from which the next path from s to the node is se- 
lected. In the worst case all these heaps are initialized in 0{m) time. Once these 
heaps are initialized, for fc > 1 the fcth shortest path is obtained in 0{Xk-i logd) 
time, where Afc-i is the minimum between n and the number of nodes in the 
(fc — l)th shortest path and d is the maximum input degree of the graph. The 
total time required to find the K shortest paths in order of increasing length 
after computing the shortest path from s to every node is 0{m + Kn\og{mln)). 
However, this worst case bound corresponds to a rather exceptional situation 
in which the one-to-all K shortest paths need to be computed. Experimental 
results, reported in Section 0 show that the REA outperforms in practice the 
algorithms by Eppstein and by Martins and Santos for different kinds 
of random generated graphs. 

2 Problem Formulation 

Let G = {V, E) be a directed graph, where V is the set of nodes and E CV xV 
is the set of arcs. Let n be the number of nodes and m the number of arcs. Let 
: if ^ IR be a weighting function on the arcs. The value of i{u, v) will be 
called length of (u, v). The set of nodes u € V for which there is an arc (u, v) in 
E will be denoted by E~^{v). The nodes in E~^{v) are called predecessors of v. 
The number of nodes in E~^(v) is known as the input degree of v and will be 
denoted by |T“^(u)|. 

Given u, v G V, a path from u to u is a sequence tt = tti • 7T2 • . . . • 7T|,r| G F+, 
where tti = u, 7T|,r| = v, and (tt^, TTi+i) G E, for 1 < i < |7 t| (the notation | • | will 
be used both for the number of nodes in a path and the cardinality of a set). 
The length of a path tt is L{'k) = X)i<i<|,r| L{'k) = 0 if |7 t| = 1. 

Let us consider the problem of computing the K shortest paths from a start- 
ing node s G V to a terminal node t G V in order of increasing length. For 
every v G V, the fcth shortest path from s to u will be denoted by Tr^(ti), and 
L^(u) = will denote its length. Self-loops, cycles and positive and neg- 

ative arc-lengths are allowed, but it will be assumed that G does not contain 
negative length cycles which can be reached from s (in which case the problem 
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would have no solution) . For the sake of notational simplicity we will present the 
new algorithm in next section assuming that there are not parallel arcs between 
the same pair of nodes, although it should be noted that this is not a requisite 
of the algorithm. 



3 Recursive Enumeration Algorithm 



3.1 Derivation and Correctness 



The following theorem formulates the computation of the K shortest paths from 
s to u as the resolution of a set of recursive equations, which for k = 1 are the 
well-known Bellman equations for the (single) shortest path problem. 



Theorem 1 For all v G V , 



L’^iv) = 


JO, 


if k = 1 and v = s; 
otherwise; 


(la) 


II 


s, 

argmin^.gC'M^l^W’ 


if k = 1 and v = s; 
otherwise; 


(lb) 


where if k = 1 and v yf 


s, or k = 2 and v = s. 


, then 





C^{v) = {n^{u) ■ V : u G r ^( 1 ;)}; (Ic) 



otherwise, if u and k' are the node and index, respectively, such that tt^ ^(u) = 
7T^ (u) ■ V then 

C'^{v) = (c'^-^{v) - {7t'='(u) • ?;}) U {7t'='+1(u) • v}, (Id) 

assuming that {7t^“''^(u) • u} denotes the empty set ifTT^~^^{u) does not exist, 
which happens when ~*’^(u) is empty. 



Proof: Let V^{v) denote the set of the k shortest paths from s to v. Each path 
in V^{v) reaches v from some node u G F~^{v). In order to compute we 

could consider, for every u G F~^{v), all paths from s to u that do not lead to a 
path in However, considering that ki < k^ L{'k^^(u)) + i{u,v) < 

{u))+£{u, v), only the shortest of these paths needs to be taken into account 
when computing Thus, we can associate to a set of candidate paths 

C^{y) among which ■n^iy) can be chosen, that contains at most one path from 
each predecessor node u G F~^{v) and is recursively defined by equation 

□ 



An alternative but equivalent formulation of these equations was initially 
given by Dreyfus who also proposed an iterative algorithm for solving them 
in which, for every k >2, the fcth shortest path from s is computed node by node 
by first computing it at nodes that are one arc distant from s in the shortest 
path, then two arcs, etc. Fox [n_m| pointed out the convenience of using heaps 
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in Dreyfus’ algorithm. In this way, the K shortest paths from s to every node 
in the graph are computed in O (m + Kn\og(m / n)) time once the shortest path 
from s to every node has been computed. However, this algorithm computes the 
K shortest paths from s to every node in the graph, even when we are interested 
only in paths between s and t. 

The following algorithm, which we will refer to as the Recursive Enumeration 
Algorithm (REA), computes the K shortest paths from s to t, and constitutes a 



direct recursive resolution of equations 



2 < k < K and v = t, once 



the shortest path from s to every other node in the graph has been computed 
in step A.l: 



A.l Compute 7r^(ti) for aWv&V by means of any adequate one-to-all shortest 
path algorithm and set fc <— 1. 

A. 2 Repeat until 7r^(t) does not exist or no more paths are needed: 

A. 2.1 Set fc <— fc + 1 and compute 7r^(t) by calling NextPath{t, k). 



For fc > 1, and once 7r^(r>), 7r^(n),..., tt^ ^(ti) are available, NextPath{v, k) 
computes 7r^(ti) as follows: 

B.l If fc = 2, then initialize a set of candidates to the next shortest path from s 
to v by setting C[v\ <— {7t^(u) ■ v : u G P~^{v) and 7r^(n) ^ n^{u) ■ ?;}. 

B.2 If u = s and fc = 2, then go to B.6. 

B.3 Let u and fc' be the node and index, respectively, such that 7 t^“^(u) = 
• v. 

B.4 If 7T^+^(u) has not already been computed, then compute it by calling 
NextPath{u, fc' + 1). 

B.5 If 7T^ exists, then insert tt^ • v in C[v]. 

B.6 li C\v] ^ , then select and delete the path with minimum length from C\v] 
and assign it to 7r^(n), else tt^{v) does not exist. 



In the case that different paths from s to r; with the same length exist, any 
of them can be chosen first, but we assume (without loss of generality) that if 
Step A.l obtains Tr^(rij) = v\ • V 2 • ■ ■ ■ ■ Vj, then it obtains TT^{vi) = v\ ■ V 2 ■ ■ ■ ■ ■ Vi 



for 1 < * < j. For graphs with zero length cycles, this implies that the particular 
7T^(u) computed in Step A.l does not contain cycles. 

The enumeration of paths by loop A. 2 can be finished once the K shortest 
paths from s to t have been obtained. However, the number of paths to be 
computed does not really need to be fixed a priori, and the algorithm can be 
used in general to enumerate the shortest paths until the first one satisfying 
some desired restriction is obtained. 

Note that C[v\ in the algorithm corresponds to C^{v) in TheoremQ Steps B.l- 
B.5 compute C^{v) from C^~^{v), according to equations (Step B.l ini- 

tializes C[v\ when the second shortest path from s to z; is required). In Step B.6, 
7T^(n) is selected from C^{v) according to 

The following theorems prove that the recursive procedure terminates and 
indicate interesting properties of the algorithm. Theorem ^proves that the re- 
cursive calls to NextPath to compute 7r^(t) visit, in the worst case, all the nodes 
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in 7T^ ^(i). Theorem0proves that, in the case that tt^ ^(t) contains a loop, the 
recursive calls finish before visiting the same node twice. 

Theorem 2 For k > 1 and for all v gV, the computation of -K^iy) by means of 
NextPath(u, k) may recursively generate calls to NextPath(u, j) only for nodes 
u in 7T^“^(u). 

Proof: Let us suppose that = u\U 2 ---Up (where ui = s and Up = 

v). For every i = 1, . . ,,p, let ki be the index such that Tr'^'{ui) = U\U 2 . ■ .Ui. 
Since (up_i) • v, NextPath{v, k) may require a recursive call to 

NextPath{up-i, kp-i + 1) in case 7r^p-i+^(up_i) has not been already computed; 
since 7r^r-i(up_i) = 7r^r-^(up_2) • Up_i, NextPath{up-i, kp-i + 1) may require 
a recursive call to NextPath{up- 2 , fcp -2 + 1); and so on. In the worst case, the 
recursive calls extend through the nodes Up, Up_i , . . .,u\. Since u\ = s = 7t^(s), 
if the recursion reaches u\, that is, if NextPath{s, 2) is called to compute 7 t^(s), 
then the condition in Step B.2 holds and no more recursive calls are performed. 

□ 



Theorem 3 For k > 1 and for all v G V, computing 7t^(u) by NextPath(u, k) 
does not generate a recursive call to NextPath(u, _;/) for any j. 

Proof: Let us suppose that 7 t^“^(u) = uiU 2 . . .Up (where ui = s and Up = v) 
contains Ui = Up, i < p. Due to the condition in Step B.4, recursive calls to 
NextPath through the nodes of 7 t^“^(u) can reach Ui+i only if u\U 2 ■ ■ - Ui+i is 
the last computed path ending at In that case, the index k' in Step B.3 
is the position of the path u\U 2 - ■ - Ui in the list of shortest paths from s to Ui, 
since it is obvious that U\U 2 ■ ■ . Ui+i = {u\U 2 ■ ■ - Ui) ■ Ui+\. Since at least the 
path u\U 2 . . .Ui . . .Uj ending at Uj (where Uj = uf) has already been generated. 
Step B.4 detects that tt^ ''’^(ui) has already been computed and NextPath is not 
invoked at □ 



3.2 Data Structures and Computational Complexity 

Representation of Paths. The paths 7 t^(u), 7t^(u), . . . ending at v can be dynam- 
ically stored (by increasing length) in Step B.6 in a linked list connected to u; 
on the other hand, each path 7 t^(u) = ■ v can be compactly represented by 

its length L^{y) and a back-pointer to the path 7 t^(u) in the predecessor node 
u (see Fig. n. In this way, each one of the K shortest paths from s to t can 
be recovered at the end of the algorithm in time proportional to the number of 
nodes in the path, following the back-pointers from the list associated with node 
t, as is usual in (single) shortest path algorithms (see for example Q). 

The following example, depicted in Fig.Q illustrates how NextPath would 
operate with this representation. Let us assume that P~^{t) = {u,v,w} and 
7T^(t) = 7T^(u) • t was chosen as the best path in C[t] = C^ft) = {7t^(u) • t, 
7T^(u) • t, 7T^(w) ■ tj. When NextPath is called in order to compute 7r^(t), Step B.3 
reaches 7r^(u) in time 0(1) through the back-pointer in Tr^{t) and Step B.4 checks 
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C[t] 




Fig. 1. Representation of paths and sets of candidates in the REA. Thick lines 
represent nodes and arcs in G. The 3 shortest paths reaching node t have been 
computed: n^{t) = n^{w) ■ t, = n‘^{w) ■ t, and n^{t) = 7t^(u) • t. These paths 
are arranged in a linked list hanging from node t. Each path is represented by a 
back-pointer to another path in a predecessor node. The set of candidates C[t] 
contains one element for each incoming arc, pointing to the last considered path 
in the corresponding predecessor node. 



in time 0(1) whether tt^(v) has been previously computed by looking for the 
path following 7r^(v) in the list of paths associated with v. If it hasn’t, NextPath 
is called to compute it. When this recursive call ends, and if we assume that 
7T^(u) exists. Step B.5 inserts 7r^(u) • t (scored by T^(u) -I- i{v,t)) into the set 
of candidates associated with t. Then, Step B.6 selects the best candidate from 
C[t] = C"*(t) = ■ t, 7r^(u) • t, TT^{w) ■ t} and links it following 7r^(t) in the 

list of paths associated with t. 

Representation of Sets of Candidates. Several alternative data structures can 
be used in practice to efficiently handle the sets of candidates. Given that each 
node keeps one candidate for each incoming arc, the total space required by all 
the sets candidates is 0{m). In each call to NextPath, at most one candidate 
is inserted in Step B.5, and one candidate is extracted in Step B.6, so that the 
total size of the set of candidates remains constant or decreases by one. 

1. If the sets of candidates are implemented with heaps, then Step B.5 requires 
time 0(log |T“^(u)|), and step B.6 requires time 0(1) by postponing the 
deletion of the minimum element in the heap until a new insertion is per- 
formed by step B.5 on the same heap. Step B.l can be performed in time 
0(|T'“^(u)|) with heap-huild Q. 

2. If the sets of candidates are kept unsorted, then Step B.5 requires 0(1) time 
and Step B.6 requires 0(|T“^(u)|) time, for every call to NextPath. This 
is asymptotically worse than using heaps. However, it is well known that in 
practice an unsorted array can be more efficient than a heap in certain cases. 
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3. To benefit from the asymptotic behavior of heaps but avoid a costly initializa- 
tion that may not be amortized unless a large number of paths is computed, 
a hybrid data structure can be used in which the set of candidates is im- 
plemented as an array that is in part unsorted and in part structured as a 
heap. Initially, the full array is unsorted; whenever a candidate in the array 
is selected, the next path obtained from it is inserted in the heap part. The 
best candidate in Step B.6 must be chosen between the best candidate in the 
heap and the best candidate in the rest of array (which is computed only 
when it is unknown) . Experimental results that are reported in Section Q 
show the convenience of this approach in some cases. 

Computational Complexity. Let us analyze the computation time required by 
the algorithm (once the shortest paths tree is available) in case that the sets of 
candidates are implemented with heaps. Since Step B.l is done at most once 
for each node in the graph, in the worst case all the sets of candidates are 
initialized in total time 0{m). The recursive calls to NextPath to compute 7r^(t) 
go through the nodes of 7r^“^(t) (TheoremQ, and never visit the same node twice 
(TheoremQ. Hence the number of recursive calls needed to compute 7r^(t) is 
upper bounded by Afc_i = min(n, |7r^“^(t)|). At any given time, the set C[v] 
contains at most one candidate for each predecessor node of v and the time 
required by Steps B.2-B.6, when executed on node v, is 0(log |T“^(u)|). In the 
worst case, Afc_i = n and then the total running time of Steps B.2-B.6 in the 
recursive calls to compute 7r^(t) is 0(nlog(m/n)), since 
and log |T“^(u)| is maximized when all the terms are of equal size. Hence, 

the total time required by the REA to find the K shortest paths in order of 
increasing length after computing the shortest path from s to every node is 
0{m -I- ATnlog(m/n)). 

We find worth to emphasize that, as a consequence of TheoremsQandQ the 
number of recursive calls generated by NextPath{t, k) is at most min(n, |7r^“^(t)|). 
Therefore, once the heaps are initialized, for fc > 1 the fcth shortest path is ob- 
tained in 0(Afc_i logc?) time, where d is the maximum input degree of the graph. 
The number of recursive calls could even be less than Afc_i if at some interme- 
diate node the next path has already been previously computed, in which case 
the recursion ends. The worst case bound 0{m + Kn\og{m/n)) corresponds in 
the REA to a rather exceptional situation in which NextPath{t, k) generates n 
recursive calls for all fc = 2, . . . , AT. This can only happen if all the K — 1 shortest 
s-t paths have at least n nodes. However, in many practical situations the short- 
est paths are composed by a small fraction of the nodes in the graph, in which 
case the time required by the REA to compute the fcth shortest path may be 
negligible. An experimental study, described in the next section, has been done 
in order to assess the efficiency of the REA in practice and to compare it with 
alternative algorithms. 

The algorithm can also be extended to find the K shortest paths from s to 
every node in a set T C 1/ by just calling NextPath{t, fc) for each t G T and for 
each fc = 2, . . . , AT, if Tr^{t) has not been previously computed. In this way, instead 
of starting the computation from scratch for each t, the previous computations 
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are reused {NextPath will also generate recursive calls only if the required paths 
have not been previously computed). The worst case time complexity of this 
algorithm is still 0{m+Kn\og{m/n)). In the particular case T = V it constitutes 
an alternative to the algorithm by Dreyfus and Fox for the one-to-all K shortest 
paths computation, with the same time complexity but differing from it in that 
the nodes do not need to be processed in any particular order. Also, the number 
of paths to be computed can be different for each node and can be decided 
dynamically, using the algorithm to obtain alternative paths only when they are 
needed. 

If K is known a priori, a further computational improvement is possible: since 
the algorithm will select at most the K best paths from C\v], in nodes v in which 
K < |T“^(u)| the size of the heap that represents C[v] can be limited to K, so 
that Step B.5 requires time 0{logK) instead of 0(log |T“^(u))| and the worst 
case time complexity of the algorithm becomes 0{m + ATn log(min(AT, m/n))). 



4 Experimental Comparison 

In this section, experimental results are reported comparing, on three different 
kinds of random generated graphs, the time required to compute the K short- 
est s-t paths by the Martins and Santos’ algorithm ^3 (MSA), the basic (more 
practical) version of Eppstein’s algorithm (EA), and the Recursive Enu- 
meration Algorithm with two different representations of the sets of candidates: 
heaps (REA) and the hybrid data structures (in part heaps and in part unsorted) 
described in the previous section (HREA). 



Implementation of the Algorithms. We have used the implementation of MSA 
made publicly available by Martins (http://www.mat.uc.pt/~eqvm). EA, REA 
and HREA have been implemented by the authors of this work and are also pub- 
licly available (ftp : //terra, act .uj i . es/pub/REA). All the programs have been 
implemented in C. 



Experimental Methodology and Computing Environment. Since MSA uses a differ- 
ent implementation of Dijkstra’s algorithm and we are interested in comparing 
the performance of the algorithms once the shortest path tree has been com- 
puted, all time measurements start when Dijkstra’s algorithm ends. All the im- 
plementations share the routines used to measure the CPU running time (with a 
resolution of 0.01 seconds). Each point in the curves shows the average execution 
time for 15 random graphs generated with the same parameters, but different 
random seeds. In all cases the greatest arc length was chosen to be 10000. 

The programs have been compiled with the GNU C compiler (version 2. 7. 2. 3) 
using the maximum optimization level. The experiments have been performed 
on a 300 MHz Pentium-II computer with 256 megabytes of RAM, running under 
Linux 2.0. 
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(a) n = 1000, m = 1000000, d = 1000. (b) n = 1000, m = 1000000, d = 1000. 




(c) n = 1000, m = 100000, d = 100. (d) n = 5000, m = 20000, d = 4. 



Fig. 2. Experimental results for graphs generated with Martins’ general in- 
stances generator. CPU time as a function of the number of computed paths, (a) 
is an enlargement of the initial region of (b) to appreciate more clearly the dif- 
ferences between the algorithms for small values of K. (Parameters: n, number 
of nodes; m, number of arcs; d = m/n, average input degree.) 
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(a) S = 1000, 77 = 100, d = 10. (b) S = 1000, 77 = 100, d = 10. 




MSA 

EA 




Number of paths 



(c) S = 100, 77 = 100, d = 10. 



(d) S' = 100, 77 = 100, d = 100. 



Fig. 3. Experimental results for multistage graphs. CPU time as a function of 
the number of computed paths, (a) is an enlargement of the initial region of 
(b). (Parameters: S, number of stages; 77, number of nodes per stage; d, input 
degree.) 
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(a) 77 = 100, d^lO, K= 1000. 



(b) ri = 100, d^lO, K ^ 10000. 




HSA 

EA 




Input degree 



(c) S = 100, ri = 100, K = 1000. (d) S = 100, r/ = 100, K = 10000. 



Fig. 4. Experimental results for multistage graphs. CPU time as a function of 
the number of stages (a and b) and the input degree (c and d). (Parameters: S, 
number of stages; rj, number of nodes per stage; d, input degree; K, number of 
computed paths.) 
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4.1 Results for Graphs Generated with Martins’ General Instances 
Generator 

First, we compared the algorithms using Martins’ general instances generator 
(geraK), the same used in the experiments reported in ^3 (and also made 
publicly available by Martins at http://www.mat.uc.pt/~eqvm). The input to 
this graph generator are four values: seed for the random number generator, 
number of nodes, number of arcs, and maximum arc length. The program creates 
the specified number of nodes and joins them with a Hamiltonian cycle to assure 
that the start and terminal nodes are connected; then, it completes the set of arcs 
by randomly choosing pairs of nodes. The arc lengths are generated uniformly 
distributed in the specified range. 

Fig .^represents the CPU time required to compute up to 100, up to 10000, 
and up to 100000 paths by each of the algorithms with this kind of graphs, for 
different values of the number of nodes and number of arcs. 

Comparing the results of REA and HREA, it can be observed that heaps (REA) 
are preferable for small values of the average input degree, while the hybrid sets 
of candidates (HREA) are preferable for large values of the average input degree 
(when initializing the heaps is more costly) . The results using totally unsorted 
sets of candidates were worse than the results with heaps in all the experiments 
and are not included in the figures for the sake of clarity. 

Regarding the behavior of Eppstein’s algorithm, it can be observed that, once 
the graph of deviation paths has been built, the K shortest paths are found very 
efficiently. However, building this graph requires (in comparison with the other 
algorithms) a considerable amount of time that can be clearly identified in the 
figures at the starting point K = 2. 

4.2 Results for Multistage Graphs 

Multistage graphs are of interest in many applications and underlie many discrete 
Dynamic Programming problems. A multistage graph is a graph whose set of 
nodes can be partitioned into S disjoint sets (stages), Vi, V 2 , ■ ■ ■ , Vs, such that 
every arc in E joins a node in Vi with a node in 14+ 1 , for some i such that 
\ < i < S. We have implemented a program that generates random multistage 
graphs with the specified number of stages, number of nodes per stage, and 
input degree (which is fixed to the same value for all the nodes) . Arc lengths are 
generated uniformly distributed between 0 and the specified maximum value. 
The results with these kind of graphs are presented in Figs. Q and 0 

Fig.Qshows that the dependency with the number of paths is similar to what 
has been observed for the graphs generated by geraK. However, in this case EA 
is the fastest algorithm to compute K = 10000 paths when the input degree is 
d = 10 and the number of stages is S' = 1000 (Fig. while REA is the fastest 
algorithm for smaller values of K (Fig.0LQ)), smaller values of S (Fig. El) or 
larger values of d (Fig. □i). 

Figs. m and05 represent the dependency of the running time with the num- 
ber of stages, to compute up to 1000 and up to 10000 paths, respectively. These 
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Fig. 5. Results for Delaunay triangulation graphs. CPU time as a function of 
(a) the number of computed paths (with 100000 nodes) and (b) the number of 
nodes (to compute 10000 paths). 



figures illustrate the evolution from Fig. El to Fig. In the case of MSA and REA, 

the dependency with the number of stages is due to the fact that 7r^(t) is com- 
puted by visiting, in the worst case, all the nodes in and in multistage 

graphs the number of nodes in any path is the number of stages. 

Figs. m and01 represent the dependency of the running time with the input 
degree to compute up to 1000 and up to 10000 paths, respectively. In these 
figures the input degree ranges from 2 to 100, and this includes the evolution 
from Fig. o to Fig. El, as well as the behavior for smaller values of the input 
degree. Observe that the time required by REA and HREA only increases slightly 
when the input degree increases, while MSA and EA are clearly more affected by 
this parameter, so that the difference between both algorithms and REA increases 
as the input degree does. 



4.3 Results for Delaunay Triangulation Graphs 

Delaunay triangulations are a particular kind of graphs that can model com- 
munication networks. We have implemented a graph generator that uniformly 
distributes a given number of points in a square and then computes their De- 
launay triangulation and assigns to each arc the Euclidean distance between the 
two joined points (nodes). The initial and final nodes are located in opposite 
vertices of the square. 

The results can be observed in Fig. o In this case EA is far from being com- 
petitive even to compute as many as 10000 paths and even though the average 
degree in these graphs is typically lower than 10. It can be clearly observed how 
the difference between EA and the other algorithms increases as the number of 
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nodes does. This can be explained by the fact that the K shortest paths in 
these graphs tend to be close to the diagonal joining the opposite vertices of the 
square. Thus, MSA and REA only need to compute alternative paths on the nodes 
of this region. 



5 Conclusions and Final Remarks 



Several algorithms have been proposed in the literature which very efficiently 
compute the K shortest paths between two given nodes in a graph. Among 
these, the algorithm proposed by Eppstein outstands because of its low asymp- 
totic complexity QQ. This algorithm includes a initial stage to build a graph of 
path deviations from which the K shortest paths are then very efficiently com- 
puted. However, the time required by the initial stage is not always worth to pay, 
as the experimental results in this paper illustrate. Martins and Santos C3 have 
presented a different algorithm that, under certain circumstances, runs faster in 
practice. A new algorithm proposed in this paper, the Recursive Enumeration 
Algorithm (REA), is another useful, practical alternative according to the exper- 
imental results. Experiments with different kinds of random generated graphs 
and for many different settings of the parameters determining the size of the 
problem have shown the superiority of the new method in many practical situ- 
ations. The REA is specially well suited for graphs in which shortest paths are 
composed by a small fraction of the nodes in the graph. On the other hand, the 
REA can be derived from a set of equations that have been formally proved to 
solve the problem, it relies on quite simple data structures, and it can be easily 
implemented. 

The REA has been extended to enumerate the fV-best sentence hypotheses in 
speech recognition LifciiiiiiliTI . This is a related problem that can be modeled 
as the search for the K shortest paths in a multistage graph, but with the 
additional requirement that only paths with different associated word sequences 
are of interest. The REA, modified to discard at intermediate nodes those partial 
paths whose associated word sequence has already been generated, has been 
applied in a task involving the recognition of speech queries to a geographical 
database in which the underlying multistage graph has (in average per utterance) 
about 2 • 10® nodes and 450 stages and in a city-name spelling recognition 
task in which the underlying multistage graph has about 2 • 10® nodes and 272 
stages o. The REA has also been extended to compute the K best solutions 
to general discrete Dynamic Programming problems E3. 
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Abstract. We present an efficient implementation of a write-only top- 
down construction for suffix trees. Our implementation is based on a 
new, space-efficient representation of suffix trees which requires only 12 
bytes per input character in the worst case, and 8.5 bytes per input 
character on average for a collection of files of different type. We show 
how to efficiently implement the lazy evaluation of suffix trees such that 
a subtree is evaluated not before it is traversed for the first time. Our 
experiments show that for the problem of searching many exact patterns 
in a fixed input string, the lazy top-down construction is often faster and 
more space efficient than other methods. 

1 Introduction 

Suffix trees are efficiency boosters in string processing. The suffix tree of a text 
t is an index structure that can be computed and stored in 0(|t|) time and 
space. Once constructed, it allows to locate any substring w of t in 0(|t(;|) steps, 
independent of the size of t. This instant access to substrings is most convenient 
in a “myriad” Q of situations, and in Gusfield’s recent book Q, about 70 pages 
are devoted to applications of suffix trees. 

While suffix trees play a prominent role in algorithmics, their practical use 
has not been as widespread as one should expect (for example, Skiena QQ has 
observed that suffix trees are the data structure with the highest need for better 
implementations). The following pragmatic considerations make them appear 
less attractive: 

— The linear-time constructions by Weiner [3, McGreight and Ukko- 
nen ^’''e quite intricate to implement. (See also Q which reviews these 
methods and reveals relationships much closer than one would think.) 

— Although asymptotically optimal, their poor locality of memory reference Q 
causes a significant loss of efficiency on cached processor architectures. 

— Although asymptotically linear, suffix trees have a reputation of being greedy 
for space. For example, the efficient representation of McGreight requires 
28 bytes per input character in the worst case. 

* Partially supported by DFG-grant Ku 1257/1-1. 



J.S. Vitter and C.D. Zaroliagis (Eds.): WAE’99, LNCS 1668, pp. 30-^3 1999- 
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— Due to these facts, for many applications, the construction of a suffix tree 
does not amortize. For example, if a text is to be searched only for a very 
small number of patterns, then it is usually better to use a fast and simple 
online method, such as the Boyer-Moore-Horspool algorithm to search 
the complete text anew for each pattern. 

However, these concerns are alleviated by the following recent developments: 

— In Q, Giegerich and Kurtz advocate the use of a write-only, top-down con- 
struction, referred to here as the wotd-algorithm. Although its time efficiency 
is 0(n log n) in the average and even O(n^) in the worst case (for a text of 
length n), it is competitive in practice, due to its simplicity and good locality 
of memory reference. 

— In Kurtz developed a space-efficient representation that allows to com- 
pute suffix trees in linear time in 46% less space than previous methods. As 
a consequence, suffix trees for large texts, e.g. complete genomes, have been 
proved to be manageable. 

— The question about amortizing the cost of suffix tree construction is almost 
eliminated by incrementally constructing the tree as demanded by its queries. 
This possibility was already hinted at in Q, where the wotd-algorithm was 
called “lazy tree” for this reason. 



When implementing the wotd-algorithm in a lazy functional programming lan- 
guage, the suffix tree automatically becomes a lazy data structure, but of course, 
the general overhead of using a lazy language is incurred. In the present paper, we 
explicate how a lazy and an eager version of the wotd-algorithm can efficiently be 
implemented in an imperative language. Our implementation technique avoids a 
constant alphabet factor in the running time01t is based on a new space efficient 
suffix tree representation, which requires only 12n bytes of space in the worst 
case. This is an improvement of 8n bytes over the most space efficient previous 
representation, as developed in Experimental results show that our imple- 
mentation technique leads to programs that are superior to previous ones in 
many situations. For example, when searching O.ln patterns of length between 
10 and 20 in a text of length n, the lazy wotd- algorithm {wotdlazy, for short) 
is on average almost 35% faster and 30% more space efficient than a linked list 
implementation of McCreight’s linear time suffix tree algorithm, wotdlazy is 
almost 13% faster and 50% more space efficient than a hash table implementa- 
tion of McCreight’s linear time suffix tree algorithm, eight times faster and 10% 
more space efficient than a program based on suffix arrays ^3’ wotdlazy is 
99 times faster than the iterated application of the Boyer-Moore-Horspool algo- 
rithm mi . The lazy wotd-algorithm makes suffix trees also applicable in contexts 
where the expected number of queries to the text is small relative to the length 



^ The suffix array construction of Q3 the linear time suffix tree construction of Q 
also do not have the alphabet factor in their running time. For the linear time suffix 
tree constructions of the alphabet factor can be avoided by employing 

hashing techniques, see however, for the cost of using considerably more space, 

see ^3' 
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of the text, with an almost immeasurable overhead compared to its eager variant 
wotdeager in the opposite case. Beside its usefulness for searching string patterns, 
wotdlazy is interesting for other problems (see the list in 0), such as exact set 
matching, the substring problem for a database of patterns, the DNA contam- 
ination problem, common substrings of more than two strings, circular string 
linearization, or computation of the q-word distance of two strings. Documented 
source code, test data, and complete results of our experiments are available at 
http : //www . techf ak . uni-bielef eld . de/~kurtz/Sof tware/wae99 . tar . gz. 

2 The tdoid-SufRx Tree Construction 

2.1 Terminology 

Let A be a finite ordered set of size k, the alphabet. S* is the set of all strings over 
S, and £ is the empty string. We use to denote the set S* \ {e} of non-empty 
strings. We assume that t is a string over S of length n >1 and that $ G A is a 
character not occurring in t. For any i G [1, n+V\, let Si = ti . . . denote the zth 
non-empty suffix of t$. A S'^-tree T is a finite rooted tree with edge labels from 
A"*'. For each a G A, every node u in T has at most one a-edge for some 

string V and some node w. An edge leading to a leaf is a leaf edge. Let u be a node 
in T. We denote u by uJ if and only if w is the concatenation of the edge labels 
on the path from the root to u. e is the root. A string s occurs in T if and only if 
T contains a node ml, for some string v. The suffix tree for t, denoted by ST(t), 
is the A+-tree T with the following properties: (z) each node is either a leaf or 
a branching node, and (m) a string w occurs in T if and only if w is a substring 
of t$. There is a one-to-one correspondence between the non-empty suffixes of 
t$ and the leaves of ST{t). For each leaf JJ we define = {j}. For each 

branching node u we define ifu) = {j | Tt-^uv is an edge in ST{t),j G iifav)}. 
i{u) is the leaf set of u. 

2.2 A Review of the loofd- Algorithm 

The wofd-algorithm adheres to the recursive structure of a suffix tree. The idea 
is that for each branching node u the subtree below u is determined by the 
set of all suffixes of t$ that have u as a prefix. In other words, if we have the 
set RifU) := {s | us is a suffix of t%} of remaining suffixes available, we can 
evaluate the node u. This works as follows: at first R(u) is divided into groups 
according to the first character of each suffix. For any character c G A, let 
group{u,c) := {w G E* | cw G R{u)} be the c-group of Rill). If for some 
c G A, groupiu, c) contains only one string w, then there is a leaf edge labeled 
cw outgoing from u. If group{u, c) contains at least two strings, then there is an 
edge labeled cv leading to a branching node ucv, where v is the longest common 
prefix {Icp, for short) of all strings in groupfu, c). The child ucv can then be 
evaluated from the set R{ucv) = {w \ vw G groupifa, c)} of remaining suffixes. 

The wofd-algorithm starts by evaluating the root from the set R{root) of all 
suffixes of f$. All nodes of ST{t) can be evaluated recursively from the corre- 
sponding set of remaining suffixes in a top-down manner. 
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Example Consider the input string t = abab. The wotd- algorithm for t works 
as follows: At first, the root is evaluated from the set R{root) of all non-empty 
suffixes of the string t$, see the first five columns in Fig. Q The algorithm 
recognizes 3 groups of suffixes. The a-group, the 6-group, and the $-group. The a- 
group and the 6-group each contain two suffixes, hence we obtain two unevaluated 
branching nodes, which are reached by an a-edge and by a 6-edge. The $-group 
is singleton, so we obtain a leaf reached by an edge labeled $. To evaluate the 
unevaluated branching node corresponding to the a-group, one first computes 
the longest common prefix of the remaining suffixes of that group. This is 6 in 
our case. So the a-edge from the root is labeled by a6, and the remaining suffixes 
a6$ and $ are divided into groups according to their first character. Since this 
is different, we obtain two singleton groups of suffixes, and thus two leaf edges 
outgoing from ab. These leaf edges are labeled by a6$ and $. The unevaluated 
branching node corresponding to the 6-group is evaluated in a similar way, see 

Fig.D 




Fig. 1. The write-only top-down construction of ST{abab) 



2.3 Properties of the lootd- Algorithm 



The distinctive property of the wofd-algorithm is that the construction proceeds 
top-down. Once a node has been constructed, it needs not be revisited in the 
construction of other parts of the tree (unlike the linear-time constructions of 
As the order of subtree construction is independent otherwise, it 
may be arranged in a demand-driven fashion, obtaining the lazy implementation 
detailed in the next section. 
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The top-down construction has been mentioned several times in the litera- 



ture 



IE! S II » 



but at the first glance, its worst case running time of 0{rv‘ 



IS 

disappointing. However, the expected running time is 0{n\ogj^.n) (see e.g. 0), 
and experiments in Q suggest that the wotd-algorithm is practically linear for 
moderate size strings. This can be explained by the good locality behavior: the 
wotd-algorithm has optimal locality on the tree data structure. In principle, 
more than a “current path” of the tree needs not be in memory. With respect 
to text access, the wetd-algorithm also behaves very well: For each subtree, only 
the corresponding remaining suffixes are accessed. At a certain tree level, the 
number of suffixes considered will be smaller than the number of available cache 
entries. As these suffixes are read sequentially, practically no further cache misses 
will occur. This point is reached earlier when the branching degree of the tree 
nodes is higher, since the suffixes split up more quickly. Hence, the locality of 
the wotd- algorithm improves for larger values of k. 

Aside from the linear constructions already mentioned, there are 0{nlogn) 
time suffix tree constructions (e.g. QQ) which are based on Hopcroft’s partition- 
ing technique ^3. While these constructions are faster in terms of worst-case 
analysis, the subtrees are not constructed independently. Hence they do not share 
the locality of the wotd-algorithm, nor do they allow for a lazy implementation. 



3 Implementation Techniques 

This section describes how the wotd- algorithm can be implemented in an eager 
language. The “simulation” of lazy evaluation in an eager language is not a very 
common approach. Unevaluated parts of the data structure have to be repre- 
sented explicitly, and the traversal of the suffix tree becomes more complicated 
because it has to be merged with the construction of the tree. We will show, 
however, that by a careful consideration of efficiency matters, one can end up 
with a program which is not only more efficient and fiexible in special applica- 
tions, but which performs comparable to the best existing implementations of 
index-based exact string matching algorithms in general. 

We first describe the data structure that stores the suffix tree, and then we 
show how to implement the lazy and eager evaluation, including the additional 
data structures. 



3.1 The Suffix Tree Data Structure 

To implement a suffix tree, we basically have to represent three different items: 
nodes, edges and edge labels. To describe our representation, we define a total 
order ^ on the children of a branching node: Let u and v be two different nodes 
in ST(t) which are children of the same branching node. Then u if and only 
if min.f(u) < min^(tJ). Note that leaf sets are never empty and i{u) fl £{v) = 0. 
Hence ^ is well defined. 

Let us first consider how to represent the edge labels. Since an edge label vis a 
substring of f$, it can be represented by a pair of pointers (i, j) into t' = t$, such 
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that V = . . .t'y In case the edge is a leaf edge, we have j = n + 1, i.e., the right 

pointer j is redundant. In case the edge leads to a branching node, it also suffices 
to only store a left pointer, if we choose it appropriately: Let u-^uv be an edge 
in ST{t). We define lp{uv) := min£(mJ) + |u|, the left pointer of uv. Now suppose 
that uv is a branching node and i = Ip (uv) . Assume furthermore that uvuJ is the 
smallest child of mJ w.r.t. the relation Hence we have mini(uv) = min i(uvw), 
and thus lp(uvw) = min i(uvw) + \uv\ = min^(mJ) + |u| + |i;| = lp(uv) + |i;|. Now 
let r — Ip(uvw) . Then v — ti . . — i — ■ ■ - tip (uv ) +|?;| — i — ii • • - lip (uvw ) — i — 

ti . . .tr-i- In other words, to retrieve edge labels in constant time, it suffices to 
store the left pointer for each node (including the leaves). For each branching 
node u we additionally need constant time access to the child of uv with the 
smallest left pointer. This access is provided by storing a reference firstehild(u) 
to the first child of u w.r.t. The Ip- and firstehild-vahies are stored in a 
single integer table T. The values for children of the same node are stored in 
consecutive positions ordered w.r.t. Thus, only the edges to the first child 
are stored explicitly. The edges to all other children are implicit. They can be 
retrieved by scanning consecutive positions in table T. 

Any node u is referenced by the index in T where lp(u) is stored. To decode 
the tree representation, we need two extra bits: A leaf bit marks an entry in T 
corresponding to a leaf, and a rightmost child bit marks an entry corresponding 
to a node which does not have a right brother w.r.t. Fig.Qshows a table T 
representing ST(abab). 



1 


6 


2 


00 


5 


CO 


5 


CO 


5 

















ab 6 $ abab$ ab$ bab$ 6$ 



Fig. 2. A table T representing ST(abab) (see Fig.Q. The input string as well as 
T is indexed from 1. The entries in T corresponding to leaves are shown in grey 
boxes. The first value for a branching node u is lp(u), the second is firstchild(u) . 
The leaves $, ab$, and 6$ are rightmost children 



3.2 The Evaluation Process 

The wofd-algorithm is best viewed as a process evaluating the nodes of the 
suffix tree, starting at the root and recursively proceeding downwards into the 
subtrees. 

We first describe how an unevaluated node u of ST(t) is stored. For the 
evaluation of u, we need access to the set R(u) of remaining suffixes. Therefore 
we employ a global array suffixes which contains pointers to suffixes of t$. For 
each imevaluated node u, there is an interval in suffixes which stores pointers to 
all the starting positions in f$ of suffixes in R(u), ordered by descending suffix- 
length from left to right. R(u) is then represented by the two boundaries left(u) 
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and rightiu) of the corresponding interval in sujfixes. The boundaries are stored 
in the two integers reserved in table T for the branching node u. To distinguish 
evaluated and unevaluated nodes, we use a third bit, the unevaluated bit. 

Now we can describe how u is evaluated: The edges outgoing from u are 
obtained by a simple counting sort using the first character of each suffix 
stored in the interval [left(u) , right(u)] of the array suffixes as the key in the 
counting phase. Each character c with count greater than zero corresponds to 
a c-edge outgoing from u. Moreover, the suffixes in the c-group determine the 
subtree below that edge. The pointers to the suffixes of the c-group are stored in 
a subinterval, in descending order of their length. To obtain the complete label 
of the c-edge, the Icp of all suffixes in the c-group is computed. If the c-group 
contains just one suffix s, then the Icp is s itself. If the c-group contains more 
than one suffix, then a simple loop tests for equality of the characters 
for j = 1, 2, . . . and for all start positions i of the suffixes in the c-group. As soon 
as an inequality is detected, the loop stops and j is the length of the Icp of the 
c-group. 

The children of u are stored in table T, one for each non-empty group. A 
group with count one corresponds to a subinterval of width one. It leads to a 
leaf, say s, for which we store lp{'s) in the next available position of table T. 
lp{s) is given by the left boundary of the group. A group of size larger than 
one leads to an unevaluated branching node, say v, for which we store left(v) 
and rightlv) in the next two available positions of table T. In this way, all 
nodes with the same father u are stored in consecutive positions. Moreover, 
since the suffixes of each interval are in descending order of their length, the 
children are ordered w.r.t. the relation The values left(y) and rightly) are 
easily obtained from the counts in the counting sort phase, and setting the leaf- 
bit and the rightmost-child bit is straightforward. To prepare for the (possible) 
evaluation of v, the values in the interval [left(y), rightly)] of the array suffixes 
are incremented by the length of the corresponding Icp. Finally, after all successor 
nodes of Tt are created, the values of leftly) and rightly) in T are replaced by 
the integers Iply) '■= suffixes[leftly)] and firstchildly), and the unevaluated bit 
for u is deleted. 

The nodes of the suffix tree can be evaluated in an arbitrary order respect- 
ing the father/child relation. Two strategies are relevant in practice: The eager 
strategy evaluates nodes in a depth-first and left-to-right traversal, as long as 
there are unevaluated nodes remaining. The program implementing this strat- 
egy is called wotdeager in the sequel. The lazy strategy evaluates a node not 
before the corresponding subtree is traversed for the first time, for example by 
a procedure searching for patterns in the suffix tree. The program implementing 
this strategy is called wotdlazy in the sequel. 

3.3 Space Requirement 

The suffix tree representation as described in Sect.^^requires 2q + n integers, 
where q is the number of non-root branching nodes. Since q = n — lin the worst 
case, this is an improvement of 2n integers over the best previous representation. 
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as described in U- However, one has to be careful when comparing the 2q + n 
representation of Sect.^3’'^ith the results of The 2(7 + 71 representation is 
tailored for the wotd- algorithm and requires extra working space of 2.5n integers 
in the worst case0The array suffixes contains n integers, and the counting sort 
requires a buffer of the width of the interval which is to be sorted. In the worst 
case, the width of this interval is n — 1. Moreover, wotdeager needs a stack of 
size up to n/2, to hold references to imevaluated nodes. 

A careful memory management, however, allows to save space in practice. 
Note that during eager evaluation, the array suffxes is processed from left to 
right, i.e., it contains a completely processed prefix. Simultaneously, the space 
requirement for the suffix tree grows. By reclaiming the completely processed 
prefix of the array suffxes for the table T, the extra working space required by 
wotdeager is only little more than one byte per input character, see TablcQ For 
wotdlazy, it is not possible to reclaim unused space of the array suffxes, since 
this is processed in an arbitrary order. As a consequence, wotdlazy needs more 
working space. 

4 Experimental Results 

For our experiments, we collected a set of 11 files of different sizes and types. 
We restricted ourselves to 7-bit ASCII files, since the suffix tree application 
we consider (searching for patterns) does not make sense for binary files. Our 
collection consists of the following files: We used five files from the Calgary 
Corpus: bookl, book2, paperl, bib, progl. The former three contain english text, 
and the latter two formal text (bibliographic items and lisp prograrnM . We added 
two files (containing english text) from the Canterbury Corpus|J IcetlO and 
alice29. We extracted a section of 500,000 residues from the PIR protein sequence 
database, denoted by pirSOO. Finally, we added three DNA sequences: ecoliSOO 
(first 500,000 bases of the ecoli genome), ychrlll (chromosome III of the yeast 
genome) , and vaeeg (complete genome of the vaccinia virus) . 

All programs we consider are written in C. We used the ecgs compiler, release 
1.1.2, with optimizing option -03. The programs were run on a Computer with a 
400 MHz AMD K6-H Processor, 128 MB RAM, under Linux. On this computer 
each integer and each pointer occupies 4 bytes. 

In a first experiment we ran three different programs constructing suffix trees: 
wotdeager, mccl, and mceh. The latter two implement McCreight’s suffix tree 
construction ^3- Tnccl computes the improved linked list representation, and 
mceh computes the improved hash table representation of the suffix tree, as 

^ Moreover, the «;otd-algorithm does not run in linear worst case time, in contrast to 
e.g. McCreight’s algorithm ^3 which can be used to construct the 5n representations 
of P3 in constant working space. It is not clear to us whether it is possible to 
construct the 2q + n representation of this paper within constant working space, or 
in linear time. In particular, it is not possible to construct it with McCreight’s ^3 
or with Ukkonen’s algorithm ^3i sse Q. 

Both corpora can be obtained from http://corpus.canterbury.ac.nz 
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described in ^3- Table ^shows the running times and the space requirements. 
We normalized w.r.t. the length of the files. That is, we show the relative time 
(in seconds) to process 10® characters (i.e., rtime = (10® • time)/n), and the 
relative space requirement in bytes per input character. For wotdeager we show 
the space requirement for the suffix tree representation (stspace), as well as the 
total space requirement including the working space, mccl and mcch only require 
constant extra working space. The last row of Tablenshows the total length of 
the files, and the averages of the values of the corresponding columns. In each 
row a grey box marks the smallest relative time and the smallest relative space 
requirement, respectively. 









wotdeager 


mccl 


mcch 


file 


n 


k 


rtime 


stspace 


Space 


rtime 


Space 


rtime 


Space 


bookl 


768771 


82 


2.82 


8.01 


9.09 


3.55 


10.00 


2.55 


14.90 


books 


610856 


96 


2.60 


8.25 


9.17 


2.90 


10.00 


2.31 


14.53 


IcetlO 


426754 


84 


2.48 


8.25 


9.24 


2.79 


10.00 


2.30 


14.53 


alice29 


152089 


74 


1.97 


8.25 


9.43 


2.43 


10.01 


2.17 


14.54 


paperl 


53161 


95 


1.69 


8.37 


9.50 


1.88 


10.02 


1.88 


14.54 


bib 


111261 


81 


1.98 


8.30 


9.17 


2.07 


9.61 


1.89 
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3701614 




2.66 


8.53 


9.71 


2.65 


10.77 


2.34 


15.40 



Table 1. Time and space requirement for different programs constructing suffix 
trees 



All three programs have similar running times, wotdeager and meeh show a 
more stable running time than meel. This may be explained by the fact that the 
running time of wotdeager and meeh is independent of the alphabet size. For a 
thorough explanation of the behavior of meel and meeh we refer to While 
wotdeager does not give us a running time advantage, it is more space efficient 
than the other programs, using 1.06 and 5.69 bytes per input character less than 
mccl and mcch, respectively. Note that the additional working space required 
for wotdeager is on average only 1.18 bytes per input character. 

In a second experiment we studied the behavior of different programs search- 
ing for many exact patterns in an input string, a scenario which occurs for ex- 
ample in genome-scale sequencing projects, see Q Sect. 7.15]. For the programs 
of the previous experiment, and for wotdlazy, we implemented search functions. 
wotdeager and mccl require 0(km) time to search for a pattern string of length 
m. mcch requires 0{m) time. Since the pattern search for wotdlazy is merged 
with the evaluation of suffix tree nodes, we cannot give a general statement 
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about the running time of the search. We also considered suffix arrays, using 
the original program code developed by Manber and Myers ^3 page 946]. The 
suffix array program, referred to by mamy, constructs a suffix array in 0{nlogn) 
time. Searching is performed in 0(m + log n) time. The suffix array requires 5n 
bytes of space. For the construction, additionally 4n bytes of working space are 
required. Finally, we also considered the iterated application of an on-line string 
searching algorithm, our own implementation of the Boyer-Moore-Horspool al- 
gorithm o, referred to by bmh. The algorithm takes 0(n + m) expected time 
per search, and uses 0{m) working space. 

We generated patterns according to the following strategy: For each input 
string t of length n we randomly sampled pn substrings si , S 2 , . . . , Sp„ of different 
lengths from t. The proportionality factor p was between 0.0001 and 1. The 
lengths were evenly distributed over the interval [10,20]. For i € [l,pn], the 
programs were called to search for pattern pi, where pi = si, iii is even, and pi is 
the reverse of s,, otherwise. Reversing a string Si simulates the case that a pattern 
search is often unsuccessful. TableQshows the relative running times for p = 0.1. 
For wotdlazy we show the space requirement for the suffix tree after all pn pattern 
searches have been performed (stspace), and the total space requirement. For 
mamy, bmh, and the other three programs the space requirement is independent 
of p. Thus for the space requirement of wotdeager, mccl, and mcch see Tabled 
The space requirement of bmh is marginal, so it is omitted in TableQ 
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Table 2. Time and space requirement for searching O.ln exact patterns 



Except for the DNA sequences, wotdlazy is the fastest and most space efficient 
program for p = 0.1. This is due to the fact that the pattern searches only 
evaluate a part of the suffix tree. Comparing the stspace columns of Tablesnand 
Qwe can estimate that for p = 0.1 about 40% of the suffix tree is evaluated. We 
can also deduce that wotdeager performs pattern searches faster than mcch, and 
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much faster than mccl. This can be explained as follows: searching for patterns 
means that for each branching node the list of successors is traversed, to find 
a particular edge. However, in our suffix tree representation, the successors are 
found in consecutive positions of table T. This means a small number of cache 
misses, and hence the good performance. It is remarkable that wotdlazy is more 
space efficient and eight times faster than mamy. Of course, the space advantage 
of wotdlazy is lost with a larger number of patterns. In particular, for p > 0.3 
mamy is the most space efficient program. Figs. QandQ give a general overview, 
how p influences the running times. Fig. 0 shows the average relative running 
time for all programs and different choices of p for p < 0.005. Fig.0shows the 
average relative running time for all programs except bmh for all values of p. We 
observe that wotdlazy is the fastest program for p < 0.3, and wotdeager is the 
fastest program for p > 0.4. bmh is faster than wotdlazy only for p < 0.0003. 
Thus the index construction performed by wotdlazy already amortizes for a very 
small number of pattern searches. 

We also performed some tests on two larger files (english text) of length 3 MB 
and 5.6 MB, and we observed the following: 

— The relative running time for wotdeager slightly increases, i.e. the superlin- 
earity in the complexity becomes visible. As a consequence, mcch becomes 
faster than wotdeager (but still uses 50% more space). 

— With p approaching 1, the slower suffix tree construction of wotdeager and 
wotdlazy is compensated for by a faster pattern search procedure, so that 
there is a running time advantage over mcch. 

5 Conclusion 

We have developed efficient implementations of the write-only top-down suffix 
tree construction. These construct a representation of the suffix tree, which re- 
quires only 12n bytes of space in the worst case, plus lOn bytes of working space. 
The space requirement in practice is only 9.71n bytes on average for a collection 
of files of different type. The time and space overhead of the lazy implementation 
is very small. Our experiments show that for searching many exact patterns in 
an input string, the lazy algorithm is the most space and time efficient algorithm 
for a wide range of input values. 
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Abstract. Algorithms for the problem of list ranking are empirically 
studied with respect to the Explicit Multi-Threaded (XMT) platform 
for instruction-level parallelism (ILP). The main goal of this study is 
to understand the differences between XMT and more traditional paral- 
lel computing implementation platforms/models as they pertain to the 
well studied list ranking problem. The main two findings are: (i) Good 
speedups for much smaller inputs are possible, (ii) In part, this finding 
is based on competitive performance by a new variant of a 1984 algo- 
rithm, called the No-Cut algorithm. The paper incorporates analytic 
(non- asymptotic) performance analysis into experimental performance 
analysis for relatively small inputs. This provides an interesting example 
where emerimental research and theoretical analysis complement one 
another. Q 

Explicit Multi-Threading (XMT) is a fine-grained computation frame- 
work introduced in our SPAA’98 paper. Building on some key ideas of 
parallel computing, XMT covers the spectrum from algorithms through 
architecture to implementation; the main implementation related inno- 
vation in XMT was through the incorporation of low-overhead hardware 
and software mechanisms (for more effective fine-grained parallelism). 
The reader is referred to that paper for detail on these mechanisms. The 
XMT platform aims at faster single-task completion time by way of ILP. 



* Partially supported by NSF grant CCR-9416890 at U. Maryland. 

^ This paper suggests a possible new utility for the developing field of algorithm 
engineering; contribute to the interplay between two research questions: how to build 
new computing systems? and how to design algorithms? 

More concretely, reducing the completion time of a single general-purpose computing 
task by way of parallelism has been one of the main challenges for the design of new 
computing systems over the last few decades. ILP has been a realm where success has 
already been demonstrated. A motivation for originally proposing XMT has been to 
provide a platform for the new utility in the context of the ILP realm. 

The referees asked us a few general questions: (i) Does it make sense to compare 
speed-ups for XMT, which for now is nothing but a simulator, with speed-ups for a 
real machine? Noting that such comparisons are standard in computer architecture 
research, we agree that caution is required, (ii) Why is the XMT model not more 
succinctly presented? The XMT model is based on an assembly language and an 
execution model since this is the way it will be implemented. T his is s imilar to the 
the underlying assumptions in measuring ILP in Section 4.7 in | H P9ti' | . (iii) How is 
bandwidth incorporated in the XMT model? Ban dwidth does not explicitly appear; 
within a chip it is generally not a bottleneck, see 1 1)1 .99 1 ). 



J.S. Vitter and C.D. Zaroliagis (Eds.): WAE’99, LNCS 1668, pp. 43-^3 1999- 
@ Springer- Verlag Berlin Heidelberg 1999 
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1 Introduction 



This work considers practical parallel list-ranking algorithms. The model for 
which programs are written is a single-program multiple-data (SPMD) “bridg- 
ing model” . This model is designated as a programmer’s model for a fine-grained 
computation framework called Explicit Multi-Threading (XMT), which was in- 
troduced in the XMT framework covers the spectrum from algo- 

rithms through architecture to implementation; it is meant to provide a plat- 
form for faster single-task completion time by way of instruction-level paral- 
lelism (ILP). The performance of XMT programs is evaluated as follow: the 
performance of a matching optimized XMT assembly code is measured within 
an XMT execution model. (We use in the current paper the so-called Spawn- 
MT programming model - the easier to implement among the two programming 
models presented in . The XMT approach deviates from the standard 

PRAM approach by incorporating reduced synchrony and departing from the 
lock-step structure in its so-called asynchronous mode. Our envisioned platform 
uses an extension to a standard serial instruction set. This extension efficiently 
implements PRAM-style algorithms using explicit multi-threaded ILP, which 
allows considerably more fine-grained parallelism than the previously studied 
parallel computing implementation platforms/ models. 

The list ranking problem was the first problem considered as we examined 
and refined many of the concepts in the XMT framework. The problem arises 
in parallel algorithms on lists, trees and graphs and is considered a fundamental 
problem in the theory of parallel algorithms. Experimental results are presented. 



Empirical study of parallel list ranking algorithms Implementation of parallel al- 
gorithms for list ra nking h as been considered by many including HR9B| , 

^3, and Section 0gives more detail. The three main observa- 

tions in this paper are as follows: (i) Good speedup relative to the corresponding 
serial algorithm are possible for smaller input sizes than before. We are not 
aware of previous results for this problem which are competitive with the coun- 
terpart serial algorithm for input size not exceeding 1000; often applicable input 
sizes have been much larger. By efficiently supporting lighter threads, the XMT 
envisioned implementation platform made this possible. List ranking is a rou- 
tine which is employed in many large combinatorial applications; we proceed to 
illustrate the importance of achieving good speed-ups for smaller inputs by com- 
paring two hypothetical parallel computer systems, called A and B. The only 
difference is that System B allows speed-ups for smaller inputs than System A. 
Assumptions Concerning System A (resp. B):ln 95% (resp. 99%) of the time in 
which the list ranking routine is used speed-ups relative to the serial algorithm is 
achieved; the speed-up is by a factor of 100. The improvement from System A to 
B is significant! If a fraction F of computation can be enhanced by a (speed-up) 
factor of S, Amdahl’s Law (^£23) implies overall speed-up of . So, 

overall speed-up for system A is q 05 - 1 - 95/100 “ 1/0.0595 < 17 and for system 
B > 50; nearly 3 times the speed-up for System A! (ii) Choosing a fastest list 
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ranking algorithm is more involved than implied by the literature. A challenge for 
our experimental work has been to compare some new insights we have with the 
known experimental work on list ranking. While our results affirm that the RM 
(Reid-Miller) algorithm is faster for large inputs, for smaller inputs a new vari- 
ant of called the No-Cut algorithm, offers competitive performance, (iii) 

Analytic (non- asymptotic) performance analysis is possible for smaller inputs. 

Due to space limitations, we must refer interested readers to . 

2 List-Ranking Algorithms 

The input is an array of N elements, each having a pointer to its successor 
in a linked list (the successor can be anywhere in the array). The pointer is 
labeled with the distance it represents. The list ranking problem is to find for 
each element in the list its total (weighted) distance from the end of the list. 

To evaluate our list-ranking algorithms, we implemented a serial list-ranking 
algorithm, with serial assembly code, and several parallel list-ranking algorithms 
with our new explicitly parallel assembly code. For each of the codes, we aimed at 
figuring out the best ILP performance, taking into account constant factors and 
not only asymptotic behavior. The serial algorithm is followed below by three 
parallel algorithms: No-Cut, Cut-6, and Wyllie’s algorithm. See Section^^for 
RM’s “random subset” algorithm. 

Serial Algorithm The serial algorithm for the problem consists of first finding 
the “head” of the list; the head is the only element of the list that is not the 
successor of any other element in the list. Second (“forward loop”), the list is 
traversed to find the ranking of the “head” , determining for each element by how 
much closer it is to the “tail” of the list than the head. Finally (“final ranking”), 
the ranking of all elements in the list are derived. 

Parallel Randomized Coin-Tossing Algorithm (No-Cut) The main new parallel 
algorithm we used is randomized and works in iterations. See Figure ^ Assum- 
ing an array with precomputed results of a “randomly tossed coin” (0 or 1) is 
provided, an iteration first assigns such 0 or 1 to each element of the list. Every 
element, which is assigned 1 whose successor is assigned 0, is “selected”; other 
elements are not selected. Note that: (i) on average a quarter of the elements are 
selected, and (ii) it is impossible to select two successive elements. So far this 
is the same basic idea introduced in The new part is: for each selected 

element, “pointer-jump over” its successor elements, stopping once another se- 
lected element is reached. This results in a new list which contains only the 
selected elements. {In the next section the length of the “chain” between selected 
elements is observed to be relatively short, implying that the performance of this 
“No-Cut” algorithm is attractive.) The selected element, which pointer-jumped 
over a non-selected element, is its “ranking parent”. All selected elements are 
included in a smaller array. The compaction into the smaller array is achieved 
using a new multiple-PS instruction. The smaller array is the input for the next 
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iteration. These “forward iterations” will result in finding the rank of the head 
of the list. All the elements which have been jumped over (i.e., skipped) during 
each forward iteration are compacted into another smaller array noting for each 
its “ranking parent” . Playing “backwards” the iterations will extend the ranking 
from ranking parents to all elements of the list. Our evaluations show that for 
some input sizes this No-Cut algorithm requires less work (but may need more 
time) than other versions and algorithms. 

Time- Emphasized Parallel Randomized Coin-Tossing Algorithm (Cut-6) A vari- 
ant of the parallel algorithm of the previous paragraph is presented next. Dubbed 
“Cut-6” , it achieves better time but does more work. That is, it runs faster than 
the No-Cut algorithm if available (machine) parallelism is above some threshold. 
The only difference relative to the No-Cut algorithm is that in the forward it- 
erations a selected element stops pointer-jumping after 6 elements even if it has 
not reached another selected element. Forward iterations also find the rank of 
the head of the list and playing the iterations “backwards” extends the ranking 
to all elements of the list. 

Wyllie’s Parallel Algorithm consists of log 2 N steps, each comprising parallel 
pointer jumping by all elements, till all elements are ranked. 



3 Optimization and Analysis of the Algorithms 



The Level of Parallelism denoted P is (for an ILP architecture) the number of 
instructions whose execution can overlap in time. P cannot exceed the product 
of the number of instructions that can be issued in one clock with the number 
of stages in the execution pipeline. (The popular text suggests in p. 430 

that multiple-issue of 64 is realizable in the time frame of the years 2000-2004; 
so, a standard 5- or 6-stage pipeline leads to P exceeding 300. It is hard to 
guess why P has hardly increased since 1995; perhaps, this is because vendors 
have not found a way to harness such an increase for cost effective performance 
improvement of existing code.) 

Our building blocks for parallel list ranking are several parallel algorithms 
each of which is non- dominated, i.e., for each of these algorit hms there is some 
input size and some P value for which it is the fastest. lyyff&l describes low- and 
high-level considerations that were involved in getting the best performance out 
of each algorithm, and for assembling them in line with the accelerating cascades 
technique (see ^2^3)- This section discusses particularly noteworthy issues 
related to work and time analysis of the algorithms. 



Serial Algorithm 

Loop unrolling is the main concern in analyzing the execution of the serial al- 
gorithm. Issues such as the size of the code, the limited number of registers R, 
and the overhead of leaving too large a “modulo” from the loop, are considered. 
Most limiting was the value of R, and discusses a choice of i? = 128. 
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Parallel Randomized Coin- Tossing Algorithm (No-Cut) 

Two issues require most attention: (i) the convergence rate of the forward it- 
erations, and (ii) bounding the length of the longest chain from one selected 
element to its successive selected element. The No-Cut algorithm is optimized 
for less work. It has smaller work constants than the time-emphasized Cut-6 
algorithm, but its asymptotic time bound includes a 0{log2^N) term. The al- 
gorithm becomes dominant (in the term ^ -|- T - a standard way for evaluating 
PRAM algorithms) for relatively small values of P. The work term is 0{N) and 
its constant is very important. But, why does the “No-cut” algorithm perform 
less work? The answer is: a significant decrease in the number of (forward and 
backward) iterations. We have longer threads but fewer of them in fewer iter- 
ations, so on balance the overall work term is smaller. (An optimization not 
discussed here is for reducing the number of spawn-join pairs.) 

Convergence of Iterations The expected fraction of skipped elements in each 
iteration is |. Let N be the initial input size, and let Mi be the expected number 
of active elements in the z-th iteration (i.e., the input size for the z-th iteration); 
we make the simplifying assumption that the number of active elements in each 
iteration will be: Mq = N, Mi < jN, ■ ■ ■ , Mi < (i)*A. So, SMi < 1.347V. The 
number of forward iterations ffiter will not exceed log^iN = Q.blog^N . 

According to the simulations for various sizes of inputs, SMi < 1.337V and 
ffiter < QAQlog 2 N] both fully supporting our analysis. 

Bounding the Length of the Longest Chain In order to analyze the execution 
time for the No- Cut algorithm we need to find a realistic bound on the longest 
chain. This bound is important since the critical-time-path length of each forward 
iteration is determined by the “slowest” thread, i.e., the thread that has to short- 
cut over the longest chain. 

Some definitions follow: (i) A “chain” is a series of consecutive (based on 
pointers) elements in a list starting at “1,0” (coin-tossing selection results) and 
ending just before the next “1,0” (coin-tossing selection results), (ii) A chain’s 
“length” is the number of elements a/7er the initial “1” (i.e., the elements which 
will actually be short-cut), (iii) Given an element, P{length > i) is the proba- 
bility that the length of a chain starting at it is i, or longer. 

Claim 1: P{length > logN -\- loglogN) < 

Verbally, this can be described as: the probability for chains whose length is 
longer than logN -\- loglogN is upper bounded by 

To prove Claim 1, we find an z such that P{length > i) < ^. The probability 
of the length of a chain will be determined as follows: (i) A chain will be of length 
> 7 if Vz : 1 < z < 7, nexf'{x)is not selected where: (1) x is the selected element 
that starts the chain, and is considered to be the zero element of the chain. (2) 
nexP{x) is the z-th successor of element x. And, (3) an element is selected only 
if its coin is “1” and its successor’s coin is “0”. (ii) j of the next‘s (x) elements 
will be selected (when next‘s (x) = 1 and next^{x) = 0). | of the next^{x) 
elements will be selected, since only 2 out of their successors which are still 
active (8 options) will be nexC{x) = 0. (iii) So the series of selected elements 
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will be: 0, . . . , . . .. The situation is depicted in FigureQ For 

the algorithm this means the following: the fraction of threads that will stop 
after 1 short-cut is j, after 2 short-cuts is |, after 3 short-cuts is and so on. 



Claim 1 follows from P{i) = 4^ < i. 

reports several simulations to support our formal analysis of the 



longest chain bounds. The simulations resulted in the following: (i) We ran over 
200 different iV-values in the range: 10^ < N < lO’’. (ii) Maxlen < 1.2S*log2N; 
Maxlen denotes the length of the longest chain, (iii) Only 6 simulations resulted 
in Maxlen > 1.2Q*log2N. (iv) Practical conclusion: E{Maxlen) = 1.2 *log 2 N. 



The Cut-6 Algorithm For analyzing the Cut-6 algorithm we consider the 
sequence which describes the convergence of the forward iterations. Recall that 
during the forward iterations, a selected element stops pointer-jumping after 6 
elements even if it has not reached another selected element. An analysis which 
is similar to the proof of Claim 1 applies. The expected fraction of skipped 
elements in each iteration will be [| + ^ + ;^ + ^+ ifg + Let 

N be the initial input size, and let Mi be the number of active elements in 
the z-th iteration. Then, under similar assumptions to the analysis of the no- 
cut algorithm, the number of active elements in each iteration will be: Mq = 
A, Ml < -^N, ■ ■ ■ , Mi < (^)*A and SMi < f||A < 1.4A. The number of 
iterations necessary for the algorithm to finish is ^03(256/73) A = 0.552log2N. 
Why, among the family of cut-k algorithms, did we chose the one that stops after 
short-cutting at most 6 elements?Wfe experimented with fc = 3, 4, 5, 6, ..., 11. The 
minimum time constant, for the 0{log2N) term, was obtained for fc = 6. The 
work constant, for the 0{N) term, continued decreasing until fc = 10, but wasn’t 
significantly smaller than for fc = 6. Thus we chose fc to be 6. 



Wyllie’s Parallel Algorithm is relatively simple. Most of its work and time 
analysis is clear from the code and the counts presented. The only place where 
additional assumptions were needed concern s the calculation of ^032 A (for the 
number of iterations). As discussed in f)V03| , the ^052 A calculation adds a large 
constant to the work term. 



General comment on compilation feasibility Translation from the high level code 
(C and extended-C) to the optimized assembly code, which has actually been 
done manually, is feasible using known compiler techniques. See 



4 The Instruction Set 

The parallel algorithms presented in this work all use an asynchronous mode 
of the so-called Spawn-MT programming model for the XMT framework; this 
allows expressing the fine-grained parallelism in the algorithms. Concretely, we 
extended standard instruction sets (for assembly languages) to enable transi- 
tion back and forth from serial state to parallel state and allow in the parallel 
state any number of (virtual) threads. A Spawn command marks a transition 
from a serial state to parallel state and a Join command marks a transition in 
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the opposite direction. See left side of Figure Q The semantics of the assem- 
bly language, called the “independence of order semantics (lOS)”, allows each 
thread to progress at its own speed from its initiating Spawn command to its 
terminating Join command, without having to ever wait for other threads. Syn- 
chronization occurs only at the Join command; all threads must terminate at 
the Join command before the execution can proceed. 

The assembly language includes multi-operand instructions, mainly the prefix- 
sum (PS) instruction, whose specification is similar to the known Fetch-and- 
Add. The paper explains why if the number of operands does not exceed 

k, for some parameter k which depends on the hardware, and each of the el- 
ements is, say, one bit, multiple-PS can be implemented (in hardware) very 
efficiently (in “unit time”), even where the base is any integer. The high-level 
language for Spawn-MT (as well as the assembly language) is single-program 
multiple-data (SPMD). It is an extension of standard C which includes Spawn 
and multioperand instructions such as prefix-sum. Using standard C scoping, the 
Joins are implicit. A Join implementation comprises a parallel sum operation for 
monitoring the number of terminating threads. 

The XM T instruction set extends the standard serial MIPS instruction set 
(see ^^^3). The table describes a few new instructions. A fuller specification 
for the explicit parallel instruction set (also referred to as spawn-join instruction 
set) can be found in the extended summary version of mmm ; it describes 
the new instructions and how they can be efficiently implemented in hardware. 
Few XMT Instructions 

Notation: i?l$ is local register number 1 of thread $. 



Instruction 


Instruction Meaning 


PS Ri, Rj 


Atomically: (1) Ri ~ Ri -|- Rj, and 

(11) Rj {original)Ri. Ri called base for PS. 


PS Ri, Rjf 
for 1 < / < <; 


Multiple prefix-sum: If a sequence of g Prefix-sum 

commands comes from one thread: 

follows serial semantics. If the g commands come from 

different threads: follow some (arbitrary) serial semantics. 

Execute by special hardware. Assumptions made: 

if each Rjf between 0 and 3, then unit time execution 

if g < k (base is any integer). 


PSM 15{Ri)Rj 


Prefix sum to memory, with base Memory [15 -|- Ri]. 
Multiple PSM: only inter-thread. 


Mark Ri, Rj 


If Ri = 0, then atomically: Ri = 1 and Rj = 0; else 

Rj = 1. Also: Multiple-Mark executed by special hardware. 


MarkM 15{Ri),Rj 


Mark to memory. Also, multiple MarkM. 



5 Execution Model and Execution Times 

We are interested in comparing performance results for the serial algorithm in a 
serial (ILP) execution model with the parallel algorithms in our new execution 
model. The performance results we obtained are derived by finding the execution 
critical-time-path T and counting the work W from the assembly code written 
for each of the algorithms in their respective instruction sets. 
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Following is an overview of the process by which we obtained the performance 
results. 

(i) Given the code, we tried to place each instruction as early as possible without 
violating data and control dependences. 

(ii) Speculation. Generally, we were able to obtain our results without resorting 
to speculation; however, we allowed speculation for branches which are known to 
happen with high probability (e.g., branching on whether an element of a linked 
list is last in the list fails for all but one of the elements of the list). 

(iii) Cache misses. We examined every memory access. If, due to spatial-locality 
or temporal-locality, the address could be found in the cache, we assumed unit- 
time memory access. If a case for spatial or temporal locality could not be made, 
we applied a penalty of d = 5 time units (to simplify our analysis). 

This is unquestionably an arbitrary decision which oversimplifies an involved sit- 
uation. The issue is how to account for memory response time without resorting 
to detailed memory system simulations; the problem is that to reach the level of 
detail that merits the use of simulations, we will have to specify the hardware 
in ways which exceed our more general bridging model and our basic intent to 
reserve judgement on hardware implementation issues which complement our 
new ideas. We could only say that picking somewhat bigger or smaller numbers 
for d did not appear to change our overall comparison of serial and parallel code 
significantly. For problems, which are not exceedingly large, the memory could 
fit within the levels of the cache (which reached 16MB for all cache levels in 1995, 
with a maximum response time of 15 cycle, see So, the number 5 is not 

unreasonable for average response time. Also: (a) the sensitivity of our speed-up 
results to replacing 5 by slightly bigger or smaller numbers was marginal; (b) 
presented and explained in rmmm. a Scheduling Lemma justifies the use of 
a single average number for simulating memory response time. 

(iv) Prefetching. To alleviate this cache-miss penalty, prefetching could begin 
prior to a branch or spawn instruction, that precedes the memory access; how- 
ever, to minimize speculative execution, their commit is deferred to after the 
resolution of the branch or spawn. 

(v) Spawn and Join instructions. Our analysis uses the following simplifying 
assumption: a Spawn instruction contributes one to the operation count, and 
a Join instruction (in each thread) is counted as one operation. Our rationale 
follows. The principal goal has been comparing the performance of two kinds of 
instruction codes: serial and parallel. We were faced with a situation where the 
extensive hardware used currently in practice in order to extract ILP from serial 
code and control its execution is not accounted for in the instruction code. This 
gives an unfair advantage to the serial code since it is likely to require more com- 
plicated hardware than for implementing our explicitly parallel code; one reason 
is that parallelism mandated by the code frees the hardware from verifying that 
no dependences occur. This stresses our case. It explains why in terms of devoted 
hardware, standard serial code is unlikely to be implemented faster, or using less 
hardware, than our envisioned parallel code (for the same level of parallelism) . 
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Still, we take the extremely conservative approach of counting the Spawn and 
Join instructions, while essentially not counting anything on the serial code end. 
The only case where we applied to serial code the same costs as to the parallel 
one was where loop parallelism, which can be determined at run-time, is needed. 
Note that this still reflects a “discount” for the serial code: in practice only some 
fixed loop-unrolling is possible for serial code, while our Spawn mechanism has 
no problem handling this for parallel code. 

(vi) Unit time instructions. All other instructions are assumed to take unit time. 

(vii) Scheduling Lemma, number of registers and code length. The Scheduling 

Lemma (in ^ assumes that the number of registers is unlimited and 

that the length of the code can be increased with no penalty (as needed for 
loop-unrolling). (However, our performance results consider limiting these as- 
sumptions, as well.) 

The detailed execution assumptions and techniques for figuring out time and 
work counts are described under http : //www . umiacs . umd . edu/~vishkin/XMT/ 
in “Explicit Multi- Threading (XMT) - Specifications and Assumptions”; they are 
followed there by a simple example, to facilitate understanding before approach- 
ing the full algorithms. The proof of the Scheduling Lemma is presented in the 
full paper of IV I ) HINflSI. 

A Conservative Comparison Approach We made every effort to avoid unfairly 
favoring the performance estimates of the new code over standard serial code. 
Concretely, we followed the following guidelines: 

(1) For standard serial code, we took T as the running time, since this is a lower 
bound on the execution time achievable. 

(2) For parallel code, we took W/p-\-T as the running time, since this is an upper 
bound on the execution time achievable. (This bound relies on the Scheduling 
Lemma in Emm-) 

6 Summary of Work and Time Execution Counts and 
Speed-Up Results 

The execution times for the four algorithms were derived by counting the execu- 
tion time and work of each algorithm, following the execution assumptions and 
techniques described above. The fully commented assembly code for each of the 
algorithms can be found through the web site http://www.umiacs.umd.edu/ 
"vishkin/XMT/. Detailed counts are also presented there in a table format for 
each of the algorithms, and the execution work and time below are derived from 
that. (For later reference, we also included data on the RM algorithm.) 
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Algorithm 


Work < 


Time 


Serial^ 

No-Cut® 

Cut-6 

Wyllie^ 

Rh^ 


15N -b 4(f ) -b 5{NmodR) + 22 
58N + 4log2N -b 2 
80. 5A -b 2.2log2N -b 2 
10Nlog2N -b 12N -b 2log2N -b 268 
32 A -b Wlog2N + 9 


9 A -b 4(f) -b i{NmodR) -b 22 
4.2log2^N -b 18.5log2N -b 27 
U.7log2N -b 27 
15log2N -b 30 
9log2^N -b 27log2N -b 6 



6.1 Speed-Up Results 



We compared the execution time for the 4 algorithms (serial, No-Cut, Cut-6 and 
Wyllie’s) for various input-size (N) values and machine parallelism (P). 

We considered P values up to 2000, which cover the horizon for the amount 
of parallelism currently envisioned for explicit ILP. 

We considered N values (input size) up to 10^. 

Upper bounds on the execution time for the parallel algorithms {^ + T) are 
compared with a lower bound on execution time for the serial algorithm (T), to 
comply with our conservative approach. 

We noticed the following: (i) The serial algorithm, under our assumptions 
and execution counts, is competitive for P < 7. (ii) Each of the three parallel 
algorithms has an area (of P- values and N-values) for which it is the best (see 
table at the end) . When looking at a specific P (along a vertical section) we see, 
as expected, that: 

^ In the serial algorithm: (i) The loop unrolling is bound by R, the number of available 
registers (which affects the work and time counts), (ii) For the comparison with the 
parallel algorithms we will only look at the lower bound of the serial algorithm, i.e., 
the time count alone (which amounts to a critical path in a dependence graph). 

® The log 2 ^N term in the time count for the “work-emphasized” No-Cut algorithm is 
based on the fact that the “longest chain” (in each iteration of the No-Cut algorithm) 
does not exceed 21 oq 2 N with high probability. 

^ The high constant in Wyllie’s algorithm’s work is due to the calculation of log 2 N 
(the number of iterations) at the beginning (assuming the N is 64 bits). 

® In the RM algorithm: (i) we used a powerful atomic instruction to eliminate multiple 
selections of the same element (and remain with one copy of each selected element). 
An alternative setting, where such an atomic instruction is not available, would imply 
that the the time and work counts increase, (ii) The random selection of elements is 
counted as one operation per iteration only. This implies that the random numbers 
should be prepared ahead of time and at a minimum scaled (which requires at least 
one additional operation) to the relevant list size at each iteration. These assumptions 
provide wide margins to make sure that we do not favor the new algorithms (No- 
Cut, Cut-6, etc) to RM’s algorithm. As a result the work and time count s for RM’s 
algorithms ended up being a bit optimistic. See later reference in Section^^ 
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— For larger N-values, the work-emphasized No-Cut algorithm is ahead. 

— For intermediate N-values, the time-emphasized Cut-6 is ahead. 

— For smaller N-values, Wyllie’s algorithm is ahead. 

Plots of the areas in which each algorithm is dominant are shown in Figure0 
Note that the table at the end of the paper adds data about RM’s algorithm 
and shows that it is ahead for larger N- Values. 



7 Other Implementations of List-Ranking Algorithms 

In list ranking is discussed. After considering several work efficient list 

ranking algorithms, such as and ^^^3? they implemented a 

new algorithm which has experimentally optimized smaller constants for the 
parallel work, but not fast asymptotic time. The algorithm does recursive iter- 
ations using a randomized division into m -|- 1 sublists, where m < . The 

variance in the size of the sublists (which does not present a problem in our 
XMT model) is amortized by giving several sublists to each processor. When 
the number of sublists becomes small enough (determined experimentally) they 
switch to use either the serial or Wyllie’s algorithm. Note that the code requires 
a careful manual adjusting of parameters to achieve good speedup each time a 
different number of processors is used. The platform used is the Cray 090 vector 
multiprocessor, with 128 vector elements per processor (in Figure 2 there). As 
explained there, when vectorizing a serial problem that requires gather/scatter 
operations, the best speedup one can expect on a single processor Cray 090 is 
about a factor of 12-18, but when the vectorized algorithm does more work than 
the serial one then one can expect smaller speedups (the vectorization is used 
to hide latencies). The results presented are for 1, 2, 4 and 8 processors. The 
speedups obtained for list ranking (presented in Figure 11 there) using 8 proces- 
sors (i.e., 8* 128 = 1024 vector elements) are approximately 40 for n = 3* 10^, 35 
for n = 4 * 10® and 4 for n = 8 * 10®. In Section ^^below, a detailed comparison 
between RM’s algorithm and our No-Cut algorithm is given. 



In |HH,06| the implementation of three parallel list ranking algorithms on the 
massively parallel SIMD (Single Instruction Multiple Data) machine MasPar 
MP-1 with virtual processing, using the MPL C-based language is presented. The 
MP-1 has 16384 physical processors. The algorithms implemented were: Wyllie’s 
suboptimal deterministic algorithm, a simple optimal randomized algorithm and 
a combination of the two where initially the randomized algorithm is run and 
when the number of elements remaining was no more than the number of physical 
processors they switch to Wyllie’s algorithm. A comparison to the sequential 
algorithm on a UNIX system is presented, where the raw computational power 
of the MP-1 was at least 63 times larger than the SPARC used, and the total 
amount of main memory available for computation on the massively parallel 
computer is approximately 32 times larger. A speedup of 2 was achieved when 
all cells in the linked list fit into the SPARC s main memory (n ~ 3 * 10®). 
However, when heavy swapping was needed in the sequential implementation, 
its performance degraded dramatically. 
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The platform used in and is the Intel Parago n, which 

has a grid like structure. Three algorithms were implemented in Wyllie’s 

pointer-jumping, independent set removal and sparse ruling set. Using 100 pro- 
cessors the best speedups were respectively: 5 for n ~ 3 * 10®, 14 for n ~ 5 * 10^ 
and 27 for n ~ 2*10®. In the main result is an “external algorithm” which 

achieves a speedup of 25 using 100 processors for n = 10®; for an illustration of 
the advantage of XMT over these multi-processing results for smaller input, see 
right side of Figure 0 

Several deterministic work-optimal EREW list ranking algorithms (see friiSdl ) 
have been implemented using Fork95 for the PAD library of PRAM algorithms 
and data structures in connection with the the SB-PRAM and Fork95 projects. 
Fork95 is a c-based parallel programming language. All the algorithms switch at 
some experimentally determined parameter to Wyllie’s algorithm. The execution 
was evaluated on a simulator for the SB-PRAM. The results reported were: with 
32 processors for n = 42968 there is a break-even between Anderson-Miller’s al- 
gorithm and the serial one; with 128 processors the best algorithm is Wyllie’s 
with speedups of approximately 3.5 for n = 42968 and 5.7 for n = 10742. Their 
practical conclusion was that none of the studied algorithms is fully satisfac- 
tory, due to large constants; although some randomized algorithm might achieve 
better results. 

7.1 Detailed Comparison to Reid- Miller’s Algorithm 

RM’s algorithm also has forward and backward iterations. A forward iteration 
works as follows. A parameter k is specified. In parallel, k random numbers each 
between 1 and n (the number of current elements in the list) are picked. Each 
element, whose index is one of these random numbers, is selected (after getting 
rid of multiple selections of the same element). For each selected element the for- 
ward iteration proceeds by pointer jumping over all non-selected elements. This 
results in linking all selected elements. All selected elements are then included 
in a smaller array. RM’s algorithm performs less work at one interesting point. 
Unlike the No-Cut and Cut-6 algorithms it need not check for each element (at 
the beginning of the forward iteration) whether it was selected. We implemented 
RM’s algorithm in the same XMT programming model. Since our main interest 
was comparison with our No-Cut algorithm, we implemented RM’s algorithm in 
assembly code for the XMT computational framework. (We left the cascading 
aside as we did for other algorithms we implemented.) 

For comparison purposes, we took the number of randomly picked sublists to 
be fc = Note, however, that in the No-Cut algorithm, we could have actually 
controlled the expected number of selected elements by picking a probability 
different than i for getting a Head at each node. In other words, No-Cut can 
also be flexible with the number of selected elements similar to RM’s algorithm. 

We note several things in our implementation of RM’s algorithm and in the 
comparison to the No-Cut algorithm. First, the random selection of k elements, 
as the heads of the sublists, is presented and counted as only one operation per 
iteration in the instruction code. This implies that the random numbers should 
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be prepared ahead of time and (at least) scaled (which should require at least one 
additional operation for each draw of a number in each iteration) to the relevant 
list-size at each iteration (to achieve uniform distribution for each iteration). 
Second, we used the markM atomic instruction (which is a simpler variant of the 
PSM instruction) to eliminate multiplicities among the k selected elements. This 
instruction is counted as 1 work unit and 5 time units (since it has to read a 
memory location). Had multiplicities been eliminated differently the time and 
work necessary per iteration would have been different. These assumptions were 
used as part of a conservative comparison with RM’s algorithm. We expected 
RM’s algorithm to perform less work, since only the threads that are active 
(i.e., selected) start working after the selection was completed. In the No-Cut 
implementation all threads are spawned and those who aren’t selected termi- 
nate immediately (which accounts for some of the additional work). We should 
remember that the No-Cut algorithm was time-optimized to compete with the 
Cut-6 algorithm, thus putting most of the effort into decreasing the critical-time- 
path even when that results in additional work. This included, amongst other 
things, using an “active-array” and copying into it most of the information in 
the elements, in order to save a few cache misses (when accessing data through 
two pointer accesses, which must be done in RM’s algorithm) . When optimizing 
for less work a different approach should be used for a better comparison. In 
such a case, we could save roughly 15 * IV work units by using a selected-list 
(similar to the one used in RM’s algorithm) at the end of the selection step 
(coin-tossing), spawning only the selected elements and not storing duplicates of 
most of the information in an active-array. This, of course, would increase the 
critical-time-path (by at least 5 * log^N), and both algorithms would be much 
more similar. 



The discussion above explains how our performance results made RM’s algo- 
rithm look much better than other algorithms. This was done since the compar- 
ison was against RM’s algorithm. The results of the analysis show that the work 
term for the RM algorithm (32IV -|- lQlog 2 N -|- 9) is smaller than the one for the 
No-Cut algorithm (see the results in sectionQ , but the time term for the No-Cut 
algorithm is smaller than the one for RM’s algorithm {^log 2 N^ +21log2N +&) . For 
smaller list sizes the No-Cut algorithm performs better. A few of the speedups, 
comparing to the serial algorithm, obtained for RM’s list ranking and for the 
No-Cut algorithm on the XMT are compared to the ones on the Cray-90 in the 
following table. 



Input Size 
n 


RM on Cray-C90 
P = 8 * 128 = 1024 


RM on XMT 
P = 1000 


No-Cut on XMT 
P = 1000 


3.5* 10^ 


speedup ~ 40 


speedup ~ 280 


speedup ~ 155 


4.2* 10“ 


speedup ~ 35 


speedup ~ 270 


speedup ~ 154 


8* 10“ 


speedup ~ 4 


speedup ~ 34 


speedup ~ 51 



Due to the new mechanism for enhancing thread spawning using PS (as per 
XMT circumvents the main hurdle for implementing RM’s algo- 
rithm on the Cray, which is the need to group several sublists for amortizing 
unequal distributions of length. RM’s solution used larger input length to over- 
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come this hurdle. However, this, in turn, increased the input length for which 
good speed-ups are obtained. For XMT one can get good speed-ups already from 
shorter inputs. We should also remember that we favored the RM implementa- 
tion on XMT for the comparison with No-Cut; this in turn somewhat biases 
its comparison to the implementation on the Cray. Detailed speedup results are 
presented in lUVOII . The analysis and simulations presented there are summa- 
rized below. The most interesting conclusion in this analysis is that the expected 
length of the longest chain in RM’s algorithm is ~ 2.75 * log 2 N, while for the 
No-Cut algorithm Section ^derived the result E(Maxlen) ~ 1.2 * log 2 N. This 
highlights an interesting advantage of the No-Cut algorithm; this advantage is 
due to its different randomized mechanism for picking “selected” elements. This 
explains why the time term for RM in Section0is larger than for No-Cut and 
why No-Cut is ahead for certain input sizes and values of P. 

Few more details on the analysis of the RM algorithm A “paper and pencil” 
analysis of the behavior of RM’s algorithm was done picking k = j elements 
randomly at each iteration as heads of the sub-lists. Also, results obtained from 
simulations of the algorithm’s behavior using the same k are reported. 

We wish to find the probability that an element is selected; i.e., picked at 
least once during the k = j draws. The probability that a specific element is 
not selected at all during all the draws, denoted as Pns, is: 

= (1)1=0.779 

n n e 

Notice that the limit of (1 — i)" goes to - very quickly (it is already very close, 
0.366 from 0.3679, for n = 100). And the probability that an element is selected 
is (1 - 0.779) = 0.221. 

This led to an analytical bound of 2.77 * log 2 N on the length of the longest 
chain. This was backed by experimental results whose practical conclusion was 
that E{Maxlen) ~ 2.75 * log 2 N . 
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Fig. 1. Upper figure: original linked list. - head of linked list. Wi - weight of 
a link. Elements numbering - their order in the linked list. Middle figure: input 
array representation of the linked list. Element numbering is in round brackets; 
pointer to array index of the next element in square brackets. Selected elements 
(H followed by T) are shaded. Lower figure: the linked list after one forward 
iteration. Backwards arrow to ranking parent. 
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Fig. 2. Probabilities of selected elements and of chain length 




Fig. 3. Left figure: Parallel states from a Spawn to a Join and serial states. Right 
figure: Comparison to [Si97b]. 



N 
















P 
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size 


10 


20 


30 


40 


50 


60 


80 


100 


200 


500 


1000 


1500 


2000 


10^ 


1.44 


2.47 


3.31 


3.98 


4.54 


5.00 


5.73 


6.28 


7.78 


9.08 


9.62 


9.81 


9.91 




alg 2 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 


alg 4 
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1.47 


2.68 


3.69 


4.55 


5.29 


5.93 


6.99 
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alg 2 


alg 2 


alg 2 


alg 2 


alg 2 
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alg 4 
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31.11 


77.67 


154.98 


231.93 
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alg 2 


alg 2 


alg 2 


alg 2 


alg 2 


alg 2 


alg 2 


alg 2 


alg 2 


alg 2 



Table Above - Best Speedups Without RM’s Algorithm These are the 
best speedups obtained when comparing the upper-bound on the parallel algo- 
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rithms (not including RM’s) with the lower-bound of the serial algorithm. The 
algorithm used are: alg 1 - serial algorithm; alg 2 - work optimized coin tossing 
algorithm (No-Cut); alg 3 - time emphasized coin tossing algorithm (Cut-6); alg 
4 - wyllie’s (pointer jumping) algorithm. Horizontal and vertical lines indicate a 
different algorithm on each side of the line. 

Table Below - Best Speedups With RM’s Algorithm 

Recall that the RM results (alg 5 below) are a bit optimistic, as explained in 
the paper. 
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Fig. 4. The areas in which each of the four algorithms for list-ranking is domi- 
nant. Left figure: Parallelism up to 50. Right figure: Parallelism up to 10,000. 
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Abstract. Given a graph G and a positive integer k, we want to find k 
spanning trees on G, not necessarily disjoint, of minimum total weight, 
such that the weight of each edge is subject to a penalty function if it 
belongs to more than one tree. We present a polynomial time algorithm 
for this problem; the algorithm’s complexity is quadratic in k. We also 
present two heuristics with complexity linear in k. In an experimental 
study we show that these heuristics are much faster than the exact algo- 
rithm also in practice, and that their solutions are around 1% of optimal 
for small values of k and much better for large k. 



1 Introduction 

Let G = (U, E) be an undirected, weighted graph with n vertices and m edges. 
Let fc be a positive integer. The edge weight function is denoted by w, and we 
assume all weights are positive integers. We extend the edge weight function to 
include an integer parameter i, 0 < i < k. We call Wp(e, i) the penalized weight 
of edge e for a given value of i; we call the parameter i the usage of edge e. 
By definition, Wp(e, 0) = 0 and Wp(e, 1) = w{e); and Wp is nondecreasing with 
respect to i. We want to solve the following problem on G: 

Find k spanning trees Ti,T 2 , . . . ,Tk of G, not necessarily disjoint, such 
that the usage of edge e is the number of trees that contain e and such 
that the sum of the penalized weights of all edges is minimum. 

In other words, we want to find k spanning trees of minimum total cost such 
that we pay a penalty every time we include an edge in more than one tree. If 
Wp(e,i) = oo for all i > 1, this is the problem of finding k disjoint spanning 
trees of minimum total cost. For future reference, we call the general problem 
the minimum eongestion k-spanning trees problem (fcMSTc), since the penalty 
function can be used to model congestion situations. 

To our knowledge the general fcMSTc problem has not been studied be- 
fore. However, disjoint-trees versions of it are well-known. Nash-Williams Q 
and Tutte Q have studied the unweighted case. Roskind and Tarjan [^, build- 
ing on work of Edmonds presented a polynomial-time algorithm for the 

weighted case. 

The fcMSTc problem is interesting in its own right. However, our main inter- 
est in this paper is to explore it as an example of the following type of situation: 
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A problem for which a polynomial-time algorithm exists, but for which, in prac- 
tice, it is better to use heuristics. Heuristics are commonly encountered and 
vastly studied in the realm of NP-complete problems. In the case of problems in 
class P, such results are much rarer. 



Table 1. Summary of theoretical results 



algorithm 


running time 


exact 


0(m log m -1- k^n^) 


A-Prim 


0{k{m -I- nlog n)) 


A-Kruskal 


0(m log m -1- k{ma{2n, n) n log n)) 


B 


0(m log m -|- fcn(log m -|- a{2n, n) log n)) 



The results we present are as follows. First we show that the fcMSTc problem 
can be solved exactly in polynomial time. Then we present two algorithms for 
which we have no guarantees on solution quality, and hence are heuristics. We 
show that their theoretical running times are better than the running time of the 
exact algorithm. A summary of theoretical results is shown in Tablen We then 
present an experimental study that shows that the heuristics are indeed much 
faster than the exact algorithm (for the graph instances tested), while finding 
solutions very close to the optimum. 



2 An Exact Algorithm 

The exact algorithm is a simple reduction of our problem to the weighted disjoint- 
trees problem (which we call the fcMSTd problem). Our algorithm therefore is 
entirely based on Roskind and Tarjan’s algorithm (hence-forward called RT). 
Our reduction assumes that the RT algorithm can handle parallel edges. This is 
an easy extension of the algorithm described in [h]. 

Here is how the reduction works. First note that a given edge e appearing 
in i trees of a given set contributes a total of iwp{e, i) to that set’s weight; and 
that the total cost of a solution is 'ieWp{e, ie), where ie is the usage of e. 

We can therefore define the incremental cost c(e, i -|- 1) of edge e appearing in 
one more tree as 



c{e,i 1) = {i l)wp{e,i 1) — iwp{e,i) . (1) 

The value of c(e, i 1) represents the increase in the solution’s value by the 
inclusion of edge e in the (z -I- l)-th tree. 

We will now reduce an instance / of the fcMSTc problem to an instance I' of 
the fcMSTd problem. Given an instance of the fcMSTc problem (defined above) 
create a graph G' = {V, E') in the following way: for each edge e = (u, v) G E, 
create fc parallel edges ei, C 2 , . . . , Cfc G E' (all of them between vertices u and v) 
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such that w'{ei) = c(e, i), 1 < i < fc. Note that if G is connected, then G' will 
certainly contain at least k disjoint spanning trees. We remark that using parallel 
edges with incremental costs is a standard way of making such reductions; see 
for example Q section 14.3]. 

We now claim that given a solution S' for we can obtain a solution S for 
I and vice versa. First note that given a pair of vertices (u, v) and the edges 
ei, 62 , . . . , Cfc between them in an edge can belong to S' if and only if every 
edge 6 j, 1 < j < i, is also part of S' [to simplify the discussion, we assume c(e, i) 
is strictly increasing w.r.t. i]. Furthermore, all these edges belong to different 
trees (because the trees must be disjoint and to avoid cycles). Based on this we 
can establish a one-to-one correspondence between a tree in S' and a tree in S. 
If in S", for a given pair (u, u), all edges up to 6 / are used, this means that the 
usage of edge (u, v) in S will be I (and vice versa). The total cost of these edges 
in S or in S' is the same and that proves the claim. 

We now very briefly describe the RT algorithm. It starts by sorting the edges 
in nondecreasing order by weight and creating a set F of k forests Fi, F 2 , ..., Fk, 
each with n vertices and no edges. The algorithm executes the following aug- 
menting step for each edge e G G, in order: And an Fi such that e can be inserted 
in Fi without creating cycles (with the insertion possibly causing a rearrange- 
ment of edges in F). The algorithm stops when all forests become trees. The 
search for a valid Fi in each augmenting step is done by a labeling process. 

Roskind and Tarjan show that this algorithm takes 0{mlogm + k'^v?) time. 
A direct application of the reduction above gives for our algorithm a running 
time of 0{km\ogkm + k'^n^) and a storage requirement of km edges. We now 
show that it is not necessary to store that many edges and that our algorithm 
can be made to run in the same time bound as the RT algorithm. 

Given the input graph G and its m edges, the reduction gives a clear rule 
for determining all km edges of G' . This means that we may keep G' implicit, 
and generate each copy of an edge from G on the fly. In addition, note that as 
soon as the algorithm decides to discard an edge e,, all other unexamined edges 
6 j, j > i between the same pair of vertices can be discarded as well. This means 
that we don’t even have to sort km values; we either discard an edge and all 
remaining parallel edges, or we use edge and insert its next “copy” (ci+i) 
into the data structure that contains the edges with the correct weight. The 
time bound depends on the data structure used to represent the sorted edge list. 
Since this list must also be updated, we need a priority queue. We assume that 
a binary heap is used. 

Initialization of the algorithm consists of building the heap (0(m log m) time) 
and creating the forests {0{kn)). Selecting an edge from the heap costs 0(1). 
Once an edge is processed, we may need to discard it or replace it with the next 
copy. In terms of heap operations, this means deleting or updating an element, 
respectively. Both require O(logm) time. 

As in the original algorithm, the labeling step is executed 0{kn) times. Each 
execution requires 0{kn) operations, in addition to the extra O(logm) time 
demanded by heap operations. There are up to 0(m) steps in which applying 
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the labeling algorithm is not necessary, since the ends of the edge being examined 
belong to the same special subset (“clump”, in RT’s terminology) of vertices. 
Each of these steps requires 0(log m) time to delete the edge from the heap. The 
overall time bound of the algorithm implemented in this way is 0(m log m + 
which is an improvement over the direct application of the reduction and 
is the same time bound of the RT algorithm. 

3 Heuristics 

In this section we present two heuristics for the fcMSTc problem. These heuris- 
tics were developed before we knew the complexity of the problem, and were 
thus contemplating the possibility that it might be NP-complete. Even though 
it turned out that the problem is polynomially solvable, the heuristics proposed 
are intuitive, asymptotically faster than the exact algorithm, and simple to im- 
plement. 

3.1 Heuristic A: Tree-Greedy 

This is a greedy heuristic, in that it computes each tree Ti based on all previously 
computed Tj trees, 1 < j < b We start with graph G\ = G, and compute T\ 
as its minimum spanning tree. We then update the weights obtaining graph G^ 
to reflect that n — 1 edges were chosen. The updated weight of a chosen edge e 
will be c(e, 2) = 2wp{e, 2) — Wp{e, 1), its incremental cost as defined in Q. The 
next tree, T 2 , will be the minimum spanning tree of G 2 and so on, until we have 
obtained k trees. 

Any minimum spanning tree algorithm can be used to implement this heuris- 
tic. We analyze two possibilities: using Prim’s algorithm and Kruskal’s algo- 
rithm Q. 

Implementing Prim’s algorithm using a Fibonacci heap makes it run in 0(m-|- 
nlogn) time. As Heuristic A executes this algorithm k times, its complexity is 
0{k{m + nlogn)). 

Kruskal’s algorithm requires a preprocessing edge-sorting step that costs 
0(m log m). Using a union-find data structure with the usual path-compression 
and weighted-union techniques, the main loop costs 0{ma(2n,n)) time. The 
complexity of the algorithm is thus dominated by the complexity of the prepro- 
cessing step. As Heuristic A requires k applications of the algorithm, its overall 
complexity would be O(fcmlogm). However, Heuristic A allows us to execute 
Kruskal’s algorithm more efficiently. After the i-th {1 < i < k) execution of 
Kruskal’s algorithm, the set of edges can be partitioned into two subsets, one 
(the first) containing the n — 1 edges of Ti, and another (the second) with the 
remaining edges. The latter is already sorted, since the incremental costs of 
its edges were not changed. For the next step, all we have to do is sort the 
first subset, merge it with the second, and apply Kruskal again. This takes 
0{ma{2n,n) + nlogn) time. This observation frees us from sorting the com- 
plete set of edges in every step. The overall complexity of the heuristic is hence 
0{m\ogm + k{ma{2n, n) + nlogn)). 
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3.2 Heuristic B: Edge-Greedy 

Heuristic B is greedy too, but “grows” the trees of the desired solution all to- 
gether, in a fashion reminiscent of the exact algorithm. We start with k forests, 
each forest Fi composed by the n vertices of G and no edges. We pick the edge 
with the smallest incremental cost (and hence need a priority queue) and check 
whether it can be inserted into some forest without creating a cycle. If we find 
such a (valid) forest, the edge is inserted and its incremental cost is updated 
in the queue; if a forest is not found, the edge can be discarded. This step is 
repeated until k{n — 1) edges are inserted into the forests, which by then are 
trees. There are at most m steps in which a valid forest is not found, and k{n—l) 
steps in which an edge is actually inserted. In each step, we must perform two 
kinds of operations: 

— Edge selection: We must either remove an edge from the queue or update 
its weight. We can perform both operations in O(logm) time if we represent 
the priority queue as a binary heap. 

“ Cycle checking: The most efficient way to manage cycle-related information 
is employing a union-find data structure. Since each forest may be checked 
in every step, all cycle-related operations (considering the entire execution 
of the algorithm) cost 0{{kn + m)a{2n, n)) time. 

All costs considered, the overall time bound of Heuristic B is 0{{kn+m){log m+ 
ka{2n, n)). We will now show that this time bound can be substantially improved 
by using several different techniques. 

Monitoring Clusters. The first idea is to use an extra union-find structure (in 
addition to the ones associated with forests) to represent clusters. Clusters are 
subsets of vertices that are in the same connected component in every forest Fi. 

We start with n disjoint sets, each representing a vertex of G, and then 
proceed to the execution of the algorithm as previously described. If we test 
an edge e = (u, v) against every relevant forest and find that none is valid for 
insertion, we conclude that vertices u and v belong to the same cluster, and 
execute union{u, v) in the extra union-find structure. The performance gain will 
be achieved if we verify whether the endpoints of an edge belong to the same 
cluster before checking all forests in each step. If the vertices do belong to the 
same cluster, we simply discard the edge and proceed to the next one, avoiding 
a fruitless search. 

The worst-case analysis of this approach is as follows: k{n — 1) success- 
ful searches, 0(n) unsuccessful searches (each resulting in a union operation), 
and 0{m) edges discarded without any search at all. Knowing that O(logm) 
operations are required to select an edge and that each search checks up to 
k forests, the overall time bound of this improved version of Heuristic B is 
0(m log m -I- kn(logm + ka{2n, n))). 

Discarding Trees. Another idea is to test edges only against a relevant subset of 
the forests, those with more than one component (other forests can be discarded. 
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since no edge can be added to them without creating cycles) . To implement this, 
we keep the relevant forests (all of them, in the beginning of the algorithm) in 
a linked list. As soon the (n — l)-th edge is inserted into a forest Fi, we remove 
Fi from the list. This makes the algorithm faster, although its asymptotic time 
bound remains unchanged. 

Ordering. For the running times presented so far, any forest scanning order can 
be used. One may even adopt different orders in each step. However, analyzing 
the forests in the exact same order in every step yields a significant improvement. 
It does not matter which order it is, as long as it is the same in every step. 

To show the improvement we will need to compare connected components 
of the forests Fi. We will do so considering only the set of vertices in each 
component (not the edges). Suppose now that the fixed order in which forests 
are considered is Fi, F2, . . . , Fk. Then, the following is true: 

Theorem 1. Any connected component in a forest Fi is a subset (not necessarily 
proper) of some connected component of Fi-i, for 1 < i < k. 

Proof. If there were a component C in Fi whose vertices were not in the same 
component of Fi_i, then there would be an edge e joining some two vertices of C 
that could be inserted in Fi_i. This contradicts the fact that edges are inserted 
in the first valid forest. □ 

The following corollary is immediate. 

Corollary 1. Let e(u, v) be an edge of G and let Fi {1 < i < k) be the first 
forest in which the insertion of e does not create a cycle (i.e., in which u and v 
are in different connected components). Then, e does not create a cycle in Fj, 
for every j such that i < j < k. 

As a consequence of this result, for each pair of vertices u, v, we can divide 
the set of forests into two subsets. In forests Fi to Fi_i, u and v belong to 
the same connected component; in forests Fi to Fk, u and v are in different 
components. Our task is to find the first forest of the second subset. This can be 
done using binary search over the k forests. This reduces the overall complexity 
of the algorithm to 0(mlogm + knifogm + a(2n, n) logfc)). This complexity 
assumes that cluster monitoring (as described above) is used. 

Applying the tree-discarding technique presented above is still possible. In 
fact, it becomes easier to implement when the forests are analyzed in a fixed 
order. It follows from Theoremnthat a forest Fi {1 < i < k) can become a tree 
only after every forest Fj (1 < j < i) has already become a tree in previous 
steps. In other words, the subset of forests which are trees is either empty or 
can be expressed as {Fi,F2, . ■ - ,Fs}, for some s < k. Thus, when a forest Fi 
becomes a tree, all we have to do is set s = z and restrict further searches to 
forests Fg+i, Fg+2, . . . ,Fk. Notice that the single variable s makes the linked list 
for the tree-discarding technique unnecessary. 
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Indexed search. Even with all the improvements mentioned so far, the running 
time of Heuristic B has a fclogfc factor, while Heuristic A is linear in k. As seen 
above, the extra log k factor is due to the cost of searching for a valid forest to 
insert a given edge. We now show that this search can be done in O(logn) time. 

For this we need the concept of component- equivalence. Two forests Fi{V, Ei) 
and Fj(V,Ej) are component-equivalent if and only if their vertices are parti- 
tioned into the same connected components (recall that by “same component” 
we mean components with the same set of vertices, but not necessarily the same 
set of edges). With this definition we can now prove the following result. 

Theorem 2. In any step of the algorithm, there are no more than min(n, k) 
sets of component- equivalent forests. 

Proof. This is trivial if fc < n, since k is the number of forests. Therefore, let us 
assume that k > n. An immediate consequence of Theoremnis that forests are 
always sorted in non-increasing order by number of edges. That is, given forests 
Fi and Ei_i (1 < z < k), either \Ei\ < \Ei-i\ or \Ei\ = \Ei-i\. They cannot be 
component-equivalent if the former holds, since the number of edges is different. 
On the other hand, if the second statement is true. Theorem ^ guarantees the 
component-equivalence of Ti_i and Fi, which means they belong to the same 
set. Thus the number of sets is equal to the number of strict inequalities. Since 
a forest with n vertices may have no more than n — 1 edges, there are forests 
with at most n different number of edges in a given step of the algorithm (from 
0 to n — 1). Hence, n is the maximum number of different sets of component- 
equivalent forests if fc > n. □ 

As described above, in every step of the algorithm we must look for a valid 
forest Fi for which i is minimum. The implementation discussed in the descrip- 
tion of the ordering technique finds this forest by performing binary search on 
the list of forests. Using Theorem Q we show that there is another and faster 
way of finding the same forest. 

Let Si (0 < i < n — 1) be the set of component-equivalent forests with i 
edges. For a given edge e, we must find the largest i such that e can be inserted 
into a forest of Si without creating a cycle. Although e could be inserted into 
any forest in Si, we must choose the first forest in Si according to the order 
in which forests are being considered. Therefore, we must keep, for each set 
Si, a reference to such forest, making the list of component-equivalent sets act 
as an index to individual forests. To find the desired forest, we can perform a 
binary search on a vector of indices, which has size n, costing us O(logn). We 
call this index vector cindex. This is the general idea. However, there is a minor 
detail that we must consider. Since any set Si may be empty in some steps of 
the algorithm, we must decide which forest should be referenced to by cindex[i] 
when this happens. We adopted the following solution: make cindex[i] represent 
the first forest with i edges or fewer. Hence, when there is no z-edge forest, we 
have cindex[i] = cindex[i — 1]. With this solution, cindex must be managed as 
follows: 
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— Initialization: Set cindex[i] = 1, for 0 < z < n — 1. 

— Search: Given an edge e, find, using binary search, the largest z such that e 
can be inserted in cindex[i] and cannot be inserted in cindex[i + 1] . 

— Update: Let Fi be the forest found in the previous step and \Ei\ be the 
number of edges in it (after e is inserted). (Since \Ei\ increased by one, we 
have to do something about cindex[\Ei\ — 1].) Set cindex[\Ei\ — 1] = z + 1, 
which means that replaces Fi as the first forest with \Ei\ — 1 edges or 
fewer. 

Notice that the second phase (search) can be unsuccessful, because there may 
be, in some step of the algorithm, no z such that e can be inserted in Fi. When 
this happens, e can be removed from the queue. 

We can now present the final analysis of Heuristic B. The only difference 
we have introduced is the number of comparisons made when searching for a 
forest to insert an edge. When cindex is used, O(logzz) comparisons are re- 
quired per search, as opposed to O(logfc) without that auxiliary vector. In both 
cases, all comparisons related to a certain forest take at most 0{na{2n, n)) total 
time. Therefore, Heuristic B with cluster control and indexed search will run in 
0(m log m -I- knilogm + a{2n, n) logzz)) time. 

4 An Experimental Study 

We have implemented and tested the exact algorithm. Heuristic A with Kruskal’s 
algorithm, Heuristic B, and a random algorithm (see next paragraph). Heuristic 
A with Prim’s algorithm and a binary heap was implemented, but preliminary 
testing showed that it was considerably slower than A-Kruskal, and hence it was 
not part of further testing. 

The random algorithm builds k trees, and for each tree it selects (zz— 1) edges 
randomly, using a union-find data structure to avoid cycle-creating edges. This 
algorithm was implemented so that the proposed heuristics could be judged not 
only on how close they get to the optimal solution but also how far they are 
from a randomly found solution. 

The implementation of the exact algorithm contains the following practical 
improvement in algorithm RT motivated by our experience in developing Heuris- 
tic B. When trying to insert an edge into a forest, we try all forests (in the same 
fixed order) rather than invoke the labeling procedure after failing to insert the 
edge into forest Fi. 

All programs were implemented in C-|— I- and compiled with the GNU G 
Gompiler using the -03 optimization flag. All tests were done on a DEG Alpha 
600au with 1 GB of RAM. Every instance fit in memory, thus limiting I/O 
operations to reading the input graph. The penalty function used was zcp(e, z) = 
iw{e). This results in a quadratic objective function. 

The implementations were mostly tested on program-generated families of 
instances. The generators were written by the authors. The families are as fol- 
lows: 
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— complu, a complete graph, with uniform distribution of distinct weights. 

— hyperb, a hypercubic lattice with a biased distribution of distinct weights; 

the bias favors small weights; these graphs are sparse (m = 4n). 

— random, a random connected graph with uniform distribution of weights. 

In reporting results we use the concept of solution quality. The quality of 
solution S given by algorithm X is simply the ratio between the cost of S and 
the optimal cost. 

4.1 Results 

TableQ compares the three implementations with respect to running time, and 
the two heuristics and the random algorithm with respect to solution quality, for 
family complu and n = 100 with varying k. TableQis similar, but the instances 
come from family hyperb, k = 100 and n varies. In both cases, running times 
in each row are averages over three different instances (different seeds); and the 
qualities reported are maxima (i.e. worst), except for the random algorithm. The 
random algorithm was run three times on each instance, and the value shown in 
each row is the best solution found in the nine instances thus solved. 

Both tables make it clear how much faster the heuristics are with respect 
to the exact algorithm. In all these cases heuristic solution quality is within 
0.05% of optimal. The random algorithm is significantly worse than the heuris- 
tics. The difference in the quality values obtained by the random algorithm in 
families complu and hyperb shows that this algorithm is sensitive to the weight 
distribution, as expected. 

To get an idea of how quality changes with different values of k we ran 
Heuristic A on a graph from TSPLIB (called brazil58). This is a complete graph 
with n = 58. The results are shown in Figure Q Results for Heuristic B are 
essentially the same. We see that the worst case occurs for fc = 5, and that for 
large values of k quality tends to 1. For this instance and k = 1000, Heuristic A 
took 0.52 secs. Heuristic B took 0.21 secs, while the exact algorithm took 22 secs. 
Similar quality behavior was observed for other inputs. The random algorithm 
in this experiment had the following behavior: for fc = 1, its quality was 6.99; 
for larger values of fc, the quality improved, reaching 2.02 for fc = 1000. 

The last experiment compares the performance of heuristics A and B with 
respect to running time, on family random, n = 500, fc = 1000. We varied the 
number of edges from m = 1200 (very sparse) to m = 124750 (complete), on a 
total of 10 m values (twelve instances for each value of m). Results are shown 
in Figure Q 

We can see there that the km factor in the theoretical running time of Heuris- 
tic A does have a considerable influence in its performance in practice. We note 
that running time variance in this experiment was large, which may explain the 
odd behavior of B’s running time towards the right end of the graph. 
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Table 2. Results for family complu, n = 100 





time (secs) 


quality 


k 


exact A B 


A 


B 


random 


100 


2.32 0.08 0.04 


1.000492 


1.000496 


4.92 


200 


6.97 0.13 0.08 


1.000150 


1.000149 


4.51 


300 


12.37 0.20 0.12 


1.000297 


1.000302 


4.29 


400 


17.67 0.28 0.16 


1.000198 


1.000197 


4.13 


500 


21.99 0.36 0.20 


1.000130 


1.000130 


4.06 



Table 3. Results for family hyperb, k = 100 





time (secs) 


quality 


n 


exact A B 


A 


B 


random 


81 


0.81 0.01 0.02 


1.000461 


1.000403 


52.29 


256 


10.75 0.05 0.10 


1.000254 


1.000258 


54.48 


625 


76.93 0.17 0.32 


1.000310 


1.000310 


29.14 


1296 


364.21 0.41 0.85 


1.000325 


1.000325 


15.69 



5 Final Remarks 

The parallel edge technique is a standard way of dealing with certain nonlin- 
ear optimization problems, such as the fcMSTc problem, studied here. We have 
shown that greedy heuristics, well implemented, represent an interesting alter- 
native to solve this problem, if optimality is not crucial but running time is. In 
addition these heuristics obtain results that are far better than random solutions. 

We have also investigated the possibility of adding a lower bound computa- 
tion to our heuristics. A simple lower bound for the fcMSTc problem is a set of 
k{n—l) edges (not necessarily distinct) of G such that the sum of their penalized 
weights is minimum and such that no more than k copies of any given edge are 
selected. The computation of this lower bound can be done as follows. Initially, 
we must insert every edge e G E into a priority queue and set counters = 0. 
In each step of the algorithm, we remove the first edge (e) from the queue and 
increment counters- If the new value of counters is k, e can be discarded; if 
counters < k, we must update the incremental cost of e and reinsert it into the 
queue. We stop after k{n — 1) steps. If the priority queue is implemented as a 
binary heap, each step will require O(logm) time. This yields a running time of 
O {kn log m), asymptotically better than any of the algorithms discussed so far. 
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This lower bound is so simple that its computation can be incorporated into 
any of the heuristics, thereby creating a program that will output a quality 
measure of the solution obtained, while still being much faster than the exact 
algorithm. However, preliminary experiments have shown that this lower bound 
is not consistently strong for this purpose (there are cases in which it attains 
only 50% of the optimum value) . We are currently trying to improve this bound 
and at the same time keep it fast to compute. 

Several other lines of future research suggest themselves: is there a faster 
exact algorithm for fcMSTc? Are the heuristics in fact approximation algorithms? 
What is the behavior of the heuristics for other penalty functions? In what other 
congestion problems can these greedy techniques be applied? 
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Abstract. The Transversal Hypergraph Problem is the problem of com- 
puting, given a hypergraph, the set of its minimal transversals, i.e. the 
hypergraph whose hyperedges are all minimal hitting sets of the given 
one. This problem turns out to be central in various fields of Gomputer 
Science. We present and experimentally evaluate a heuristic algorithm 
for the problem, which seems able to handle large instances and also 
possesses some nice features especially desirable in problems with large 
output such as the Transversal Hypergraph Problem. 

1 Introduction 

Hypergraph theory is one of the most important areas of discrete mathematics 

with significant applications in many fields of Computer Science. A hypergraph 
is a generalized graph defined on a finite set V of nodes, with every hyper- 
edge £ oiH. being a subset of V. A hypergraph is a convenient mathematical 
structure for modeling numerous problems in both theoretical and applied Com- 
puter Science and discrete mathematics. One of the most intriguing problems on 
hypergraphs is the problem of computing the transversal hypergraph of Ti., de- 
noted tr(7i). The transversal hypergraph is the family of all minimal hitting sets 
(transversals) of Ti,, that is, all sets of nodes T such that (a) T intersects all hy- 
peredges of and (b) no proper subset of T does, transversal hypergraph 
is the problem of generating tr(7i) given a hypergraph Ti., and is an important 
common subproblem in many practical applications. Its importance arises from 
the fact that problems referring to notions like minimality or maximality are 
quite common in various areas of Computer Science. 

For example, concepts of propositional circumscription and minimal 

diagnosis Q restrict interest to the set of models of an expression that are mini- 
mal. In circumscription, model checking for circumscriptive expressions reduces 
to determining the minimality of a model, whereas in model-based system diag- 
nosis, finding a minimal diagnosis is equivalent to finding the prime implicants of 
an expression and, next, finding the minimal cover of them. Moreover, finding a 

* Research supported by the University of Patras Research Committee (Project 
Caratheodory under contract no. 1939). 
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maximal model is essential task in model-preference default inference since, 
given a set of default rules, our aim is to find a maximal (most preferred) model 
of these rules. It can also be seen that the problem of computing the transver- 
sal hypergraph is an alternative view of the problem of generating all maximal 
models of a Boolean expression in CNF, having all its variables negated. Such 
an expression can be seen as a hypergraph whose hyperedges are the clauses of 
the expression and each node of a hyperedge corresponds to a negative variable 
of the corresponding clause. The maximal models of the expression then cor- 
respond to the transversals of the hypergraph. A symmetric situation holds for 
the minimal models of an expression having all its variables positive. Complexity 
questions related to minimal or maximal models have been discussed in 
more recent results can be found in m 

An encyclopedic exposition of the applications of the transversal hyper- 
GRAPH problem can be found in feig . We briefly state from there certain prob- 
lems in the design of relational databases distributed databases Q, 

and in model-based diagnosis Q. Another interesting connection was pointed 
out between the transversal hypergraph problem and the rapidly growing 
held of knowledge discovery in databases, or data mining |jj^. 



In this paper we present and experimentally evaluate a heuristic algorithm 
for solving the transversal hypergraph problem. The algorithm was im- 
plemented and tested on a number of randomly generated problem instances. 
The experimental results show that the algorithm computes all transversals of a 
given hypergraph correctly and efficiently. This fact makes our heuristic suitable 
for solving problems in all areas mentioned above. 



One has to be careful in defining efficiency of algorithms for problems like 
the TRANSVERSAL HYPERGRAPH. It is not hard to see that the transversal hy- 
pergraph tr(7f) of a hypergraph Ti. may have exponentially many hyperedges 
with respect to the number of nodes and the number of hyperedges of 7i. It 
will therefore require exponential amount of time to compute tr(7i) in the worst 
case. Therefore, the usual distinction between tractable and intractable prob- 
lems based on the existence or not of a polynomial-time algorithm, clearly does 
not apply here. Instead, more elaborate complexity measures have to be defined, 
that will take into account the size of output, too. It is natural to consider as 
tractable a problem with large output if it can be solved by an algorithm that is 
polynomial in both the input and the output. Such algorithms are called output- 
polynomial. A slightly stronger requirement is that the algorithm generates a 
new output bit in time polynomial in the input and the output so far. These 
latter algorithms are called incrementally output-polynomial. An even stronger 
requirement is that the algorithm generates two consecutive output bits in time 
bounded by a polynomial in the input size. These are called polynomial delay 
algorithms. There has been a recent surge of interest in such algorithms. For 
discussions of algorithms with output and performance criteria see 



III IK lu; 



The precise complexity of the transversal hypergraph problem is still 
unknown. The brute force algorithm given by Berge Q needs time exponential 
in both the input and the output. However, several special cases can be solved 
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in polynomial time Recently, an output-subexponential algorithm was given 
by Fredman and Khachiyan in 0. There, it was shown that the duality of two 
monotone Boolean expressions in DNF can be checked in time 0(n*°s”), where 
n is the combined size of the input and the output. It is not hard to see that this 
problem is another disguised form of the transversal hypergraph problem 
(see Id IPI for further details on these issues). 

Our algorithm presents in practice a remarkable uniformity in its output rate; 
averaging over relatively small parts of the output (e.g. 100 transversals out of a 
total output of tenths of thousands) , we get delays deviating by at most 3 mean 
values. (We mention however, that at present we can prove no bound for the delay 
between consecutive outputs.) This happens partially because our algorithm 
operates in a generate- and- forget fashion i.e., no previous transversal is required 
for the generation of the next ones. In contrast, both the brute force algorithm 
and the Fredman-Khachiyan algorithm require all previous transversals to be 
stored. Moreover, the former will output the first transversal after exponentially 
long delay. Our approach also greatly reduces the memory requirements, since 
previously generated transversals need not be stored. In a different situation, the 
memory requirements could be devastating as the total number of transversals 
can be enormous. In addition, absolute time delays are very small, allowing the 
successful handling of large problems. 

The rest of this paper is organized as follows: In Section 2 we describe our 
heuristic algorithm for generating all transversals of a hypergraph. In the next 
section we compare our algorithm with some other approaches while in Section 4 
we give some implementation details. Experimental results on the performance of 
the algorithm are summarized in Section 5. Finally, in Section 6 some conclusions 
are given. 



2 Description of the Algorithm 

The proposed algorithm is based on the following simple algorithm of Berge 
(see 10 I ) : Consider a hypergraph Ti = {£i, . . . ,£m}- Assume that we have 
already computed the transversal hyper graph tr(^) of Q = {£i,...,£fc}, for 
some fc < m. It is easy to see that tr(^ U {£k+i}) — {min{f U {u}} : t G 
tr(^) and v G £fc+i}, where by min{t U {u}} we denote the set of minimal subsets 
of t U {u} that are hitting sets of ^ U {£k+i}- Based on this observation we 
may find all transversals of G by starting from the transversals of Si (note that 
the transversals of a hypergraph with a single hyperedge are all its nodes) and 
adding one-by-one the rest of the hyperedges, computing at each step the set of 
transversals of the new hypergraph. The algorithm terminates after the addition 
of £m- 

There are several drawbacks in the above scheme regarding its efficiency 
which are explained in detail in the next section. We only mention here the most 
severe one, in view of the complexity measures for this problem: The computa- 
tion of the first final transversal (a transversal of the input hypergraph Ti.) is 
accomplished after all transversals of the graph {Si, . . .,Sm-i} have been com- 
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puted. This means that it may take an exponentially long time before the first 
transversal is output. No less important are the memory requirements that also 
emerge from the above: all intermediate transversals have to be stored and kept 
until used for the computation of the new transversal set. 

We now explain our algorithm. Consider a hypergraph = {Si, . . . , Sm} 
defined on a set V of n nodes. We call a set of nodes X <Z V a generalized 
node if all nodes in X belong in exactly the same hyperedges of Ti. Assume that 
the hypergraph Ti. has a generalized node X with cardinality \X\ > 2. Consider 
the hypergraph Ti! which follows from Ti by replacing all nodes in X in all 
hyperedges that they appear by a new node vx- Let now tr(Tf') be the set of 
transversals of Ti' . The importance of the concept of generalized node follows 
from the observation that tr(7f) = {t\ vx) x X, for all t G tr(Ti') such that 
Vx G t. In other words, the transversals of Ti follow by taking one by one the 
transversals of Ti' that include the node vx and replacing vx by each node in X 
in turn. 

It is obvious that the number of transversals of Ti produced from a single 
transversal oi Ti' is \X\. The transversals of Ti' that do not include vx remain 
as they are, since they hit Ti. Going one step further, if a hypergraph Ti has two 
generalized nodes, X\ and X2, we can compute the transversals of the hypergraph 
Ti' that follows from Ti by removing the nodes of X\ and X2 and replacing them 
with two new nodes, vxi and vx .2 > respectively. We compute now the transversals 
of Ti by taking the transversals of Ti' and substituting the generalized nodes vx^ 
and VX 2 (where they appear) by all possible combinations of (simple) nodes in X\ 
and X 2 , respectively. Clearly, this procedure can be generalized to any number 
of generalized nodes. 

We exploit the concept of the generalized vertex in such a way that, during 
all intermediate steps, we keep only the generalized transversals which, in turn, 
are split after the addition of the new hyperedge. This dramatically reduces the 
number of intermediate transversals, especially at the early stages (where the 
generalized nodes are few but large) and greatly improves the time performance 
and the memory requirements. 

Example 1. Assume that the first two hyperedges have 100 nodes each: £i = 
{ui, . . . , uioo} and S 2 = {^ 51 , ■ ■ ■ , i^iso}- The partial hypergraph {Si,S 2 \ has 
2550 transversals (2500 with two nodes and 50 with one) which must be kept 
for the subsequent stage if we use the straightforward scheme. Using the gen- 
eralized node approach, we have only 2 transversals to store, namely the set 
{{u 5 i, . . . , uioo}} and the set {{ui, . . . , U 50 }, {uioi, . . . , uiso}}- 

The second improvement to the simple scheme of Berge was especially designed 
having in mind the rate of output of the algorithm. Recall that the simple scheme 
must make a lot of computation before outputting the first transversal, after 
which all the rest follow almost with zero delay one from the other. This occurs 
because the simple scheme is based on a sort of breadth-first computation of 
the transversals (all transversals are computed after a new hyperedge is added). 
Instead, our implementation computes the transversals in a depth-first manner: 
At a certain level, we compute a transversal of the partial hypergraph and then 
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Fig. 1. Transversal tree of the hypergraph Ti 

{{1, 2, 3}, {3,4, 5}, {1, 5}, {2, 5}}. The tree is visited in preorder. 



add to it the next hyperedge. From this transversal several others follow, as we 
described above. However, instead of computing them all, we pick one, add the 
next hyperedge and continue until all hyperedges have been added; in this case 
we output the final transversal. We then backtrack to the previous level, pick 
the next transversal, etc. The whole scheme resembles a preorder visit of a tree 
of transversals with root the single (generalized) transversal of the first hyper- 
edge, and internal nodes at some level, the generalized transversal of the partial 
hypergraph at that level. The descendants of a transversal are the transversals 
of the next hypergraph which include this transversal. Finally, the leaves of the 
tree at level m are the transversals of the original hypergraph. 

Example 2. Consider the hypergraph with 5 nodes and 4 hyperedges Ti = {{1,2, 
3}, {3, 4, 5}, {1, 5}, {2, 5}}. The tree of transversals which corresponds to the ad- 
dition of the hyperedges according to the giver order (top to bottom) is shown 
in Fig.Q Generalized nodes are denoted by circles with thin lines. For instance, 
a partial transversal of the hypergraph consisting of the first two hyperedges is 
{{1,2},{4,5}}. 

The efficiency of the above is further improved by a selective way of produc- 
ing new transversals at an intermediate level. This idea results in ruling out 
regeneration of transversals at the cost that some intermediate nodes may have 
no descendants. The advantage of this approach is that search in some sub- 
trees stops at higher levels instead of exhaustively generating everything that 
would subsequently need to be compared to previous transversals and, possibly, 
discarded. The method and its correctness is described and proved in 

3 Comparison to Other Approaches 

We already mentioned that the algorithm of Berge outputs its first transversal 
near the end of the total computation. Thus, it may take exponentially long time 
for the first transversal to output. In contrast our algorithm, as the experiments 
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show, delivers its output in quite a uniform way from the beginning to the end of 
the computation. From the implementation point of view the algorithm of Berge 
requires all intermediate transversals to be stored and kept until used for the 
computation of the new transversal set. Therefore the memory requirements are 
proportional to the size of the transversal tree. Since the number of transversals 
can be exponential, we conclude that the memory requirements can become 
devastating. Our algorithm instead, operates in a generate-and-forget fashion 
by visiting the transversal tree in preorder and so its memory requirements are 
proportional to the depth of the transversal tree rather than to its size. There are 
also other points of improvement in our algorithm compared to the algorithm of 
Berge; one has to do with the concept of generalized variables. In addition to the 
compact form of storing intermediate transversals, our algorithm also excludes 
the possibility of generating a transversal more than once which is a possibility 
in Berge’s algorithm. The additional copies of a transversal must be identified 
and removed, otherwise they will result in a blow-up of the total number of 
partial transversals at the next steps. Additionally, there is an unnecessarily 
large number of intermediate transversals (especially in problem instances with 
many nodes) that have to participate in the computation of the transversal set 
of the new hypergraph. This is something that further adds to the blow-up of 
the transversals of the new hypergraph. In contrast, by using the concept of 
generalized variables, our algorithm greatly reduces the number of intermediate 
transversals, as illustrated in Example Q Regarding the total running time of 
Begre’s algorithm, as an example we mention that we ran Berge’s algorithm for 
hypergraphs with 30 nodes and 30 hyperedges. The running time was more than 
60 seconds while, as illustrated in Tabled our algorithm requires only 1 second 
for hypergraphs with this size. 

The algorithm of Fredman and Khachiyan has the best provable bound on its 
time performance. The algorithm actually solves the decision problem, namely 
the problem of deciding, given hypergraphs Ti, and Q, whether tr(7f) = Q and, 
if not, it returns a minimal hitting set of one of the two hypergraphs that is not 
a hyperedge of the other. The algorithm runs in time 0(n*°s”) where n is the 
combined size of Ti. and Q. By using repeatedly this algorithm as a subroutine, the 
generation problem can be solved in incremental time 0(n*°®") M- We are not 
aware of any implementation of the Fredman-Khachiyan algorithm. However, as 
in the previous case, both the input and the output so far have to be stored and 
consequently the same problem regarding the memory requirements exists here 
as well. Regarding its time performance, this algorithm operates in each decision 
step on both the input and the output so far and thus, the delay increases after 
each output step. In contrast, our algorithm seems to remain relatively stable 
in this regard. In any case, a careful implementation of the Fredman-Khachiyan 
algorithm would be of great interest. 

Finally, for the sake of completeness, we mention the possibility of computing 
the transversal hypergraph by generating all possible hitting sets of the hyper- 
graph (which are of the order of 2”) and subsequently checking each if it is a 
minimal hitting set. Naturally, this simple algorithm overcomes the problem of 
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storing all transversals, but its time performance is unacceptable (it can be used 
for hypergraphs having no more than 15 nodes). Yet, we have implemented it, 
since it is a reliable test for verifying the correctness of our algorithm. 

4 Implementation Issues 

The main body of the program consists basically of two procedures. (We only 
mention here the more important formal parameters.) 
procedure add_next Jiyperedge (t ,h) { 

Update the generalized nodes] 
while generate_next_transversal (t ,t0 { 
if h is last, then output t 
else { 

Let h' be the next hyperedge ; 
add_next_transversal (t',h') ; 

}; 

}; 

}; 

and the boolean function generate_next_transversal(t,tO. 

The first adds to the partial transversal t the next hyperedge h and repeatedly 
calls the second one which returns the next partial transversal t' of the new 
hypergraph. generate_next_transversal is called until no more transversals 
follow from t after the addition of h, in which case generate_next_transversal 
becomes false. After a new transversal t' is returned, add_next_hyperedge is 
called recursively for t' and the next hyperedge. The recursive implementation 
was chosen as a fast solution to the task of developing a fairly complex code. Since 
add_uext_hyperedge is called once for each hyperedge, the depth of recursive 
calls equals the number of hyperedges. Hence, for some problem instances this 
can be quite large, resulting in poor memory usage and even in a code not as fast 
as it could be. We plan however to remove recursion from future versions of the 
program thus solving the above problems. This is why we have chosen to count 
primarily as a measure of performance the number of recursive calls between 
consecutive generations and not absolute time. Note however that, as shown in 
the next section, even the absolute time performance is quite satisfactory even 
for large problems. We stress that this can be improved further. 

The performance of the program is very much affected by the data types 
used for storing generalized nodes. This is because the use of generalized nodes 
resulted in a code that does intensive set manipulation operations. These sets 
represent sets of nodes (actually integers from 1 to the number of nodes of the 
hypergraph). Several different methods were tried in different versions of the 
code. The fastest implementation represents a set as a bit vector that spans as 
many computer words as required depending on its size (our machine has 32-bit 
word) . This choice may slightly limit the portability of the code but it is by far 
the fastest. Set operations (union, intersection, complementation, etc.) are then 
accomplished as low level bit operations (or, and, etc.). 
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5 Experimental Results 

In this section we present the experimental evaluation of the performance of 
the algorithm on a set of test cases. Experiments were carried out on a Sun 
Enterprise 450, using GNU C++ 2.8.1 with compiler setting -03. 

Since real data were not available, the program was evaluated using a num- 
ber of randomly generated hypergraphs. A random hypergraph generator was 
implemented and used for this task. Given the number of nodes, n, and the 
desired number of hyperedges, m, the random hypergraph generator uniformly 
and independently generates m sets of nodes, each of them corresponding to a 
hyperedge of the instance. The cardinality of each set lies between 1 and n — 1, 
and is randomly chosen, too. The outcoming hypergraph is simple, that is, there 
is no hyperedge that appears twice or is fully included in another one. Reports 
are averages over 50 different runs for each instance size, using a different initial 
seed for the random hypergraph generator in every run. 

The first few experiments aimed at verifying the correctness of the code. This 
was done by implementing the simple algorithm that computes all transversals of 
a particular hypergraph using exhaustive search and then checking for minimal- 
ity. Since this simple algorithm needs exponentially many steps to compute all 
minimal transversals, it could only be used for small instances (hypergraphs with 
at most 15-16 nodes). The resulting transversal hypergraph was then compared 
to the output of the program. 

Testing the correctness of the code for larger instances was accomplished by 
applying the duality theorem of transversal hypergraph: tr(tr(7i)) = (see 
for more theoretical issues on hypergraphs). We run our algorithm to the out- 
put for a particular instance and verified that the new output was the original 
input. Notice that the above theorem also offers the possibility for generating 
non-random problem instances with specific properties of the output that would 
otherwise be impossible to generate by any conventional random instance gen- 
erator; we were therefore able to test our algorithm for problems with very large 
number of hyperedges but very few transversals. 

The results of the experiments are summarized in the tables that follow. 
Problem sizes are identified by the number of nodes, n, and the number of hy- 
peredges, m, of the hypergraph. In Tablen the first three columns (after column 
m) give the size of the output (the minimum, the maximum, and the average), 
in thousands of transversals, over the 50 experiments that were conducted for 
each pair n and m. It is important to notice that these numbers also characterize 
the problem size. Larger problem instances with respect to n and m resulted in 
a number of transversals too large to be generated within some moderate time. 
We next report the performance of the algorithm in terms of its rate of output. 
Specifically, the next three columns give the delay time (in number of calls of 
procedure add_next Jiyperedge) between consecutive outputs. The first one re- 
ports the maximum delay that was observed in the whole process of generating 
the transversals. This number characterizes the worst-case performance of the 
algorithm. The unexpectedly good behavior of this parameter is what we believe 
it deserves further theoretical investigation. The next column presents the av- 
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erage delay while the last one reports the total running time for generating all 
transversals. As before, every entry in these columns is the average (over the 50 
runs) of the corresponding parameters. The last column of Table ^reports the 
total CPU time (in seconds) required, on the average, for each run. 

The experiments summarized in TableQaimed in better studying the rate of 
the output. To this end, the standard deviation a of the delays was calculated. 
Each row in Table 0 corresponds to a single problem (as opposed to results 
from 50 runs that were reported in TableQ since averaging standard deviations 
is absurd. We have chosen however, from the 50 different experiments with the 
same n and m, to report the one with the worst behavior with respect to a. Even 
in this case, the values of a remain relatively small (3 times the average delay at 
most). The size of the output (transversals) as well as the total time (number of 
calls of procedure add_next_hyperedge for generating all transversals) and the 
CPU time required are also reported. 

In problems reported in Tables all randomly generated hypergraphs 

have a small number of hyperedges and a very large number of transversals. 
Problems of this kind have relatively small delays, since the output is very large. 
The duality property of the transversal hypergraph mentioned above provides a 
way of testing the algorithm in non-random instances with very large number of 
hyperedges and a small (known a priori) number of transversals. As a result, the 
delays are now large, comparable to the total running time. The technique for 
doing this was by running the algorithm with input the output of a particular 
randomly generated instance. This also serves as a test of the correctness of the 
algorithm as already explained. The results are summarized in Table0 Notice 
that the problem size is now defined by the number of nodes, n, and the number 
of transversals, while the number of edges, m, is now the size of the output. 

6 Conclusion 

In this paper we describe an implementation of a new algorithm for solving the 
TRANSVERSAL HYPERGRAPH problem. This problem has certain peculiarities 
regarding its efficiency, in that both the total running time and the rate of the 
output (delay between consecutive outputs) are of equal importance. Our ex- 
periments show a surprisingly even rate of the output, something which calls for 
further theoretical justification. Moreover, both the delays between consecutive 
outputs and the total running time suggest that quite large instances (with re- 
spect to both their input and output sizes) can be solved efficiently. Future work 
will include the removal of recursion, since this will further speed up the code 
and more in-depth study of the performance measures of the program. 
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Table 1. Delay time and CPU time for various problem instances. Reports are 
averages over 50 runs. Each problem instance is defined by the number of nodes 
n and the number of hyperedges m. Transversals corresponds to the number 
of generated transversals while Delay Time to the number of calls of procedure 
add_next Jiyper edge . 



m 


Transversals (x 10^) 
min max average 


Delay Time (xlO^) 
max average total 


CPU Time 
(in seconds) 


n = 


15 


10 


0.03 


0.1 


0.1 


0.01 


0.004 


0.3 


0 






30 


0.08 


0.3 


0.2 


0.1 


0.02 


3 


0 






50 


0.1 


0.4 


0.3 


0.2 


0.03 


9 


0.1 






100 


0.2 


0.6 


0.5 


0.5 


0.1 


30 


0.2 






200 


0.3 


1 


0.6 


1 


0.1 


90 


1 






400 


0.4 


1.2 


0.9 


2 


0.3 


250 


2 






600 


0.7 


1.5 


1.1 


4 


0.5 


500 


4 






800 


1 


1.6 


1.4 


5 


0.6 


800 


7 






1000 


1.4 


2 


1.7 


6 


0.7 


1200 


10 


n = 


20 


10 


0.1 


0.4 


0.2 


0.02 


0.004 


0.6 


0 






30 


0.3 


1 


0.7 


0.1 


0.02 


10 


0.1 






50 


0.4 


2 


1.2 


0.3 


0.03 


30 


0.2 






100 


0.5 


3 


2 


0.7 


0.1 


100 


1 






200 


2 


6 


3 


2 


0.1 


500 


5 






400 


1 


9 


6 


4 


0.3 


1800 


20 






600 


2 


11 


8 


7 


0.5 


3800 


50 






800 


2 


13 


9 


9 


0.6 


5800 


70 






1000 


2 


14 


9 


12 


0.9 


8200 


110 


n = 


30 


10 


0.2 


2 


0.6 


0.01 


0.003 


2 


0 






30 


2 


12 


5 


0.2 


0.01 


70 


1 






50 


2 


20 


10 


0.4 


0.02 


200 


2 






70 


5 


40 


20 


1 


0.03 


700 


10 






100 


5 


80 


30 


1 


0.05 


1800 


20 






200 


10 


200 


80 


3 


0.1 


9100 


150 






400 


30 


350 


200 


9 


0.2 


46000 


900 






600 


15 


500 


300 


12 


0.4 


11000 


7700 


n = 


40 


10 


0.1 


3 


1.3 


0.01 


0.003 


4 


0 






30 


5 


60 


25 


0.4 


0.01 


300 


4 






50 


11 


300 


90 


1 


0.02 


1600 


20 






70 


13 


400 


200 


2 


0.03 


4400 


80 






100 


13 


1300 


400 


3 


0.04 


1600 


360 


n = 


50 


10 


0.3 


8 


3 


0.02 


0.003 


7 


0.1 






30 


5 


400 


100 


0.5 


0.01 


900 


10 






50 


40 


2000 


470 


1 


0.02 


7200 


140 






70 


20 


4100 


1200 


4 


0.02 


28000 


2300 


n = 


60 


10 


1 


15 


5 


0.02 


0.002 


12 


0.3 






30 


20 


1500 


300 


1 


0.01 


2600 


50 






50 


30 


3700 


1300 


5 


0.02 


21000 


450 
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Table 2. Statistical measures {Average Delay and standard deviation a) for indi- 
vidual problem instances. Entries at the 1st and 2nd column ( Transversals and 
Total Time, respectively) correspond to the number of generated transversals 
and the number of calls of procedure add_next Jiyperedge, respectively. 



Transversals Total Time Average Delay CPU Time 

m (xlO®) (xlO®) (xlO®) cr(xlO®) {in seconds) 



n = 15 10 

30 


0.1 

0.1 


0.4 

2 




0.002 

0.022 


50 


0.3 


10 


0.04 


0.040 


100 


0.2 


30 


0.1 


0.1 


200 


0.4 


70 


0.2 


0.2 


400 


0.5 


200 


0.4 


0.5 


600 


0.7 


450 


0.6 


0.8 


800 


1 


820 


0.7 


0.8 


1000 


2 


1200 


0.8 


1 



0.2 0.8 

0.3 6 

1 40 

2 100 

3 650 

6 250 

6 5000 

6 6000 

5 8000 



0.6 3 

4 80 

10 500 

20 100 

30 3000 

80 2000 

70 33000 

300 180000 



n = 40 10 2 6 

30 40 600 

50 70 2000 

70 300 11000 

100 700 40000 





2 8 

400 4200 

3400 63000 
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Table 3. Total time, CPU time, and statistical measures for individual non- 
random (dual) problem instances. The size of the output, m, is known a priori. 
Each row corresponds to individual runs. Total Time is the number of calls of 
procedure add_next_hyperedge for the generation of all transversals. 



Transversals 


m 


Total Time 
(xlO®) 


Average Delay 
(xlO®) 


cr(xlO^) 


CPU Time 
(in seconds) 


n = 15 


80 


10 


0.7 


0.07 


0.03 


0 




200 


30 


10 


0.4 


0.3 


0.1 




200 


50 


10 


0.2 


0.1 


0.1 




500 


100 


90 


1 


0.8 


1 




800 


200 


200 


1 


0.6 


2 




1000 


400 


450 


1 


0.8 


4 




1100 


600 


500 


0.8 


0.6 


4 




1200 


800 


700 


1 


0.6 


5 




1700 


1000 


1500 


1.5 


1.1 


10 


n = 20 


250 


10 


4 


0.4 


0.2 


0.1 




500 


30 


40 


1.5 


1 


1 




1100 


50 


20 


3 


2 


4 




1400 


100 


400 


4 


4 


10 




3000 


200 


1900 


10 


8 


50 




8000 


400 


25000 


60 


50 


500 




8000 


600 


29000 


5 


40 


500 




9000 


800 


20000 


20 


20 


500 




12000 


1000 


44000 


40 


40 


1000 


O 

CO 


600 


10 


20 


2 


2 


1 




4200 


30 


2700 


90 


130 


300 




20000 


50 


103000 


2100 


2700 


12000 


o 


1300 


10 


50 


5 


5 


20 




4300 


20 


3000 


150 


190 


500 




14000 


30 


51000 


1700 


1900 


12000 


n = 50 


2600 


10 


100 


10 


10 


70 




32000 


20 


27000 


1400 


1600 


45000 
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Abstract. Non-Euclidean TSP construction heuristics, and especially 
asymmetric TSP construction heuristics, have been neglected in the lit- 
erature by comparison with the extensive efforts devoted to studying 
Euclidean TSP construction heuristics. Motivation for remedying this 
gap in the study of construction approaches is increased by the fact 
that such methods are a great deal faster than other TSP heuristics, 
which can be important for real time problems requiring continuously 
updated response. The purpose of this paper is to describe two new con- 
struction heuristics for the asymmetric TSP and a third heuristic based 
on combining the other two. Extensive computational experiments are 
performed for several different families of TSP instances, disclosing that 
our combined heuristic clearly outperforms well-known TSP construction 
methods and proves significantly more robust in obtaining high quality 
solutions over a wide range of problems. We also provide a short overview 
of recent results in domination analysis of TSP construction heuristics. 



1 Introduction 



A construction heuristic for the traveling salesman problem (TSP) builds a tour 
without an attempt to improve the tour once it is constructed. Most of the con- 
struction heuristics for the TSP are very fast; they can be used to produce 

approximate solutions for the TSP when the time is restricted, to provide good 
initial solutions for tour improvement heuristics, to obtain upper bounds for 
exact branch-and-bound algorithms, etc. 

Extensive research has been devoted to construction heuristics for the Eu- 
clidean TSP (see, e.g., C3)- Gonstruction heuristics for the non-Euclidean TSP 
are much less investigated. Quite often, the greedy algorithm is chosen as a con- 
struction heuristic for the non-Euclidean TSP (see, e.g., ^3). Our computational 
experiments show that this heuristic is far from being the best choice in terms 
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of quality and robustness. Various insertion algorithms mj which perform very 
well for the Euclidean TSP produce poor quality solutions for non-Euclidean 
instances. 

Hence, it is important to study construction heuristics for the asymmetric 
TSP (we understand by the asymmetric TSP the general TSP which includes 
both asymmetric and symmetric instances) . Our aim is to describe two new con- 
struction heuristics for the asymmetric TSP as well as a combined algorithm 
based on those heuristics. In this paper we also present results of our compu- 
tational experiments obtained for several different families of TSP instances. 
These results show that the combined algorithm clearly outperforms well-known 
construction heuristics for the TSP. While other heuristics produce good qual- 
ity tours for some families of TSP instances and fail for some other families of 
instances, the combined algorithm appears much more robust. Being a heuristic 
the combined algorithm is not always a winner among various heuristics. How- 
ever, we show that it obtains (relatively) poor quality solutions rather seldom. 

We also discuss recent results in the new area of domination analysis of TSP 
construction heuristics. Domination analysis is aimed to provide some theoretical 
foundations to construction heuristics. 

The vertex set of a weighted complete digraph K is denoted by V{K)-, the 
weight of an arc xy of K is denoted by cxixy). (We say that K is complete if 
the existence of an arc xy in K, x y € V{K), implies that the arc yx is also 
in K.) The asymmetric traveling salesman problem is defined as follows: given a 
weighted complete digraph K on n vertices, find a Hamiltonian cycle {tour) H 
of K of minimum weight. A cycle factor of AT is a collection of vertex-disjoint 
cycles in K covering all vertices of K. A cycle factor of K of minimum (total) 
weight can be found in time 0{n^) using assignment problem (AP) algorithms 
(for the corresponding weighted complete bipartite graph) Clearly, the 

weight of the lightest cycle factor of K provides a lower bound to the solution 
of the TSP (AP lower bound) . 

We will use the operation of contraction of a (directed) path P = viV2---Vs of 
K. The result of this operation is a weighted complete digraph K/ P with vertex 
set V{K/P) = V{K) U {p} — {vi,V2, ■■■, Us}, where p is a new vertex. The weight 
of a arc xy of K/P is 



CK{xy) ii X ^ p and y ^ p 
CK/p{xy) = <( CK{vsy) iix=p and y^p 
ck{xv\) ii X ^ p and y = p 



( 1 ) 



Sometimes, we contract an arc a considering a as a path (of length one) . 



2 Greedy and Random Insertion Henristics 

These two heuristics were used in order to compare our algorithms with well- 
known ones. 

The greedy algorithm finds the lightest arc a in AT and contracts it (updating 
the weights according to ifl). The same procedure is recursively applied to the 
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contracted digraph K := Kja till K consists of a pair of arcs. The contracted 
arcs and the pair of remaining arcs form the ’’greedy” tour in K. 

The random insertion heuristic chooses randomly two initial vertices ii and 
12 in K and forms the cycle - Then, in every iteration, it chooses randomly 
a vertex ^ oi K which is not in the current cycle and inserts t in the 

cycle (i.e., replaces an arc imim+i of the cycle with the path imiim+i) such that 
the weight of the cycle increases as little as possible. The heuristic stops when 
all vertices have been included in the current cycle. 



3 Modified Karp-Steele Patching Heuristic 



Our first heuristic (denoted by GKS) is based on the well-known Karp-Steele 



patching (KSP) heuristic 



The algorithm can be outlined as follows: 



1. Construct a cycle factor F of minimum weight. 

2. Choose a pair of arcs taken from different cycles in F, such that by patching 
(i.e. removing the chosen arcs and adding two other arcs that join both cycles 
together) we obtain a cycle factor (with one less cycle) of minimum weight 
(within the framework of patching) . 

3. Repeat step 2 until the current cycle factor is reduced to a single cycle. Use 
this cycle as an approximate solution for the TSP. 



The difference from the original Karp-Steele algorithm is that instead of joining 
two shortest cycles together it tries all possible pairs, using the best one. (The 
length of a cycle C is the number of vertices in C.) 

Unfortunately, a straightforward implementation of this algorithm would be 
very inefficient in terms of execution time. To partly overcome this problem we 
introduced a pre-calculated nxn matrix C of patching costs for all possible pairs 
of arcs. On every iteration we find a smallest element of C and perform corre- 
sponding patching; also, the matrix is updated to reflect the patching operation 
that took place. Having observed that only a relatively small part of C needs 
to be re-calculated during an iteration, we cache row minima of C* in a separate 
vector H, incrementally updating it whenever possible. If it is impossible to up- 
date an element of B incrementally (this happens when the smallest item in a 
row of C has been changed to a greater value), we re-calculate this element of B 
by scanning the corresponding row of C in the beginning of the next iteration. 
Finally, instead of scanning all elements of C in order to find its minimum, 
we just scan n elements of B to achieve the same goal. 

Although the improved version has the same O(n^) worst-case complexity as 
the original algorithm, our experiments show that the aforementioned improve- 
ments yield significant reduction of execution time. A detailed description of the 
approach outlined earlier follows. 
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Pseudo-code 

BestCost is C, BestNode is B 

Arguments: 

N - number of nodes 
W = (wij) - N X N matrix of distances 

(* Next[i] = j if the current cycle factor contains the arc {i,j) *) 

Next: array[l..fV] of integer; 

(* Cost[i,j] is cost of removing arcs {i,Next[i]) and {j,Next[j]), and 
adding {i,Next[j]) and {j,Next[i]) instead. *) 

Cost: array[l..fV, 1..N] of integer; 

(* BestCost[i] contains a smallest value found in the z-th row of Cost *) 
BestCost: array[l..fV] of integer; 

(* BestNode[i] is column index of corresponding BestCost[i] in Cost *) 
BestNode: array[l..fV] of integer; 

(* Whenever possible, caches row minimum in BestNode and BestCost *) 

procedure UpdateCost(r, c, newCost: integer); 

begin 

if (r < c) then Swap(r, c); (* Exchange r and c values *) 

Cost[r, c] := newCost] 

if BestNode[r] yf —1 and newCost < BestCost[r] then 
(* New value is smaller than current row minimum *) 

BestNode[r] := c; 

BestCost[r] := newCost] 

else if BestNode[r] = c and newCost > BestCost[r] then 

(* Current row minimum has been updated to a greater value *) 
BestNode[r] := —1; (* stands for ’’unknown” *) 

BestCost[r] := +oo; 

end if; 
end; 

function GetPatchingCost(z, j: integer): integer; 
begin 

if vertices z and j belong to the same cycle then 
return +oo; (* patching not allowed *) 
else 

return W[i, Next[j\] + W[j, Next[i\] — W[i, Next[i\] — W[j, Next[j\] 

end if; 
end; 
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BEGIN 

Build a cycle factor by solving LAP on the distances matrix, store result in 
Next and number of cycles obtained in M ; 

(* Initialize Cost, BestCost and BestNode *) 

for z := 1 to IV do 

bn := 1; 

for j := 1 to z do 

Cost[i,j] := GetPatchingCost{i, j); 
if Cost[i,j] < Cost[i,bn] then bn := j; 
end for; 

BestNode[i] := bn; 

BestCost[i] := Cost[i,bn]; 

end do; 

repeat M — 1 times 

Find the smallest value in BestCost and store its index in z; 
j := BestNode[i]; 

update Cost: 

1) for each pair of nodes k and I, such as k belongs to 

the same cycle as z, and I belongs to the same cycle as j: 
UpdateCost{k, I, +oo); 

2) for each node m, which does not belong to the same cycle as 
either z or j: 

U pdateC ost{i, m, CetPatchingC ost{i, m)); 

UpdateC ost{m, j, CetPatchingCost{m, j)); 

Patch two cycles by removing arcs {i, Next[i]), {j,Next[j]), and 
adding {i, Next[j])and{j, Next[i]); update Next to reflect the patching 
operation; 

(* re-calculate BestNode and BestCost if necessary *) 
for z := 1 to fV do 

if BestNode[i] = —1 then 
(* Needs re-calculating *) 
bn := 1; 

for j := 1 to z do 

if Cost[i,j] < Cost[i,bn] then bn := j; 
end for; 
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BestNode[i] := brv, 
BestCost[i] := Cost[i,bn]; 

end if; 
end for; 

end repeat; 

END. 



4 Recursive Path Contraction Algorithm 

The second heuristic originates from ^3- of fho main features of this heuris- 
tic is the fact that its solution has a large domination number. We discuss some 
recent results on domination analysis of TSP algorithms in Section Q 
The algorithm (denoted by RPC) proceeds as follows: 

1. Find a minimum weight cycle factor F . 

2. Delete a heaviest arc of each cycle of F and contract the obtained paths one 
by one. 

3. If the number of cycles is greater than one, apply this procedure recursively. 

4. Finally, we obtain a single cycle C. Replace all vertices of C with the corre- 
sponding contracted paths and return the tour obtained as a result of this 
procedure. 

5 Contract-or-Patch Heuristic 

The third heuristic (denoted by COP - contract or patch) is a combination of 
the previous two algorithms. It proceeds as follows: 

1. Fix a threshold t. 

2. Find a minimum weight cycle factor F. 

3. If there is a cycle in F of length (= number of vertices) at most t, delete 
a heaviest arc in every short cycle (i.e. of length at most t) and contract 
the obtained paths (the vertices of the long cycles are not involved into the 
contraction) and repeat the above procedure. Otherwise, patch all cycles 
(they are all long) using GKS. 

Our computational experiments (see the next section) showed that t = 5 
yields a quite robust choice of the threshold t. Therefore, this value of t has been 
used while comparing COP with other heuristics. 
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6 Computational Results 



We have implemented all three heuristics along with KSP, the greedy algorithm 
(GR), and the random insertion algorithm (RT), and tested them on the following 
seven families of instances of the TSP: 



1. all asymmetric TSP instances from TSPLib; 

2. all Euclidean TSP instances from TSPLib with the number of nodes not 
exceeding 3038; 

3. asymmetric TSP instances with cost matrix C = (cij), with Cij indepen- 
dently and uniformly chosen random numbers from {0, 1, 2, 10®}; 

4. asymmetric TSP instances with cost matrix C = (cij), with Cij indepen- 
dently and uniformly chosen random numbers from {0, 1,2, ...,i x }}; 

5. symmetric TSP instances with cost matrix C = (cij), with Cij indepen- 
dently and uniformly chosen random numbers from {0, 1, 2, ..., 10®} (i < j); 

6. symmetric TSP instances with cost matrix C = (cij), with Cij indepen- 
dently and uniformly chosen random numbers from {0,l,2,...,ix}} (i < j); 

7. sloped plane instances (^3)? which are defined as follows: for a given pair 
of nodes pi = (xi, yi) and pj = (xj, yj) the distance is 



c(bi) = Xj)'^ + {yi - yj)‘^-max{0,yi-yj)+2xmax{0,yj-yi) . (2) 

We have tested the algorithms on sloped plane instances with independently 
and uniformly chosen random coordinates from {0, 1, 2, ..., 10®}. 

For the families 3-7, the number of cities n was varied from 100 to 3000, in 
increments of 100 cities. For 100 < n < 1000, all results are average over 10 
trials each, and for 1000 < n < 3000, the results are average over 3 trials each. 

Our implementations of RPC and COP algorithms make use of a shortest 
augmenting path algorithm, described in , for solving the assignment prob- 



lem. This algorithm is of high performance in practice and provides a partial 
explanation for the relatively small execution times for RPC and COP seen in 
Table 2. 

All tests were executed on a Pentium II 333 MHz machine with 128MB of 
RAM. All results for TSPLib instances are compared to optima, results for other 
instances are compared to a lower bound obtained by solving the corresponding 
assignment problem. 

Tables 1 and 2 show an overview of the results we have obtained. 

Observe that COP is the only heuristic from the above six that performs 
well on all the tested families. The others fail on at least one of the families. 
Note also that for symmetric TSP instances as well as for the instances close to 
symmetric (the Random Sloped Plane instances) the lower bound that we used 
is far from being sharp. Apart from being effective, COP is also comparable 
with respect to the execution time with (often used for the asymmetric TSP) 
the greedy algorithm. 
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Table 1. Average excess over optimum or AP lower bound 



Family 


Instances 


GR 


RI 


KSP 


GKS 


RPG 


GOP 


Family 1 


26 


30.62% 


17.36% 


4.29% 


3.36% 


18.02% 


4.77% 


Family 2 


69 


18.29% 


11.61% 


15.00% 


17.26% 


36.72% 


17.52% 


Family 3 


160 


320.13% 


1467.38% 


3.11% 


3.09% 


106.65% 


1.88% 


Family 4 


160 


515.10% 


1369.13% 


2.06% 


2.02% 


146.73% 


1.11% 


Family 5 


160 


246.65% 


1489.95% 


744.22% 


586.92% 


183.57% 


79.87% 


Family 6 


160 


405.77% 


1386.32% 


562.79% 


195.06% 


229.80% 


83.77% 


Family 7 


160 


2201.19% 


41.78% 


44.20% 


46.33% 


72.17% 


47.29% 



Table 2. Average execution time (sec) 



Family 


Instances 


GR 


RI 


KSP 


GKS 


RPG 


GOP 


Family 1 


26 


0.0917 


0.0100 


0.0547 


0.0597 


0.0508 


0.1949 


Family 2 


69 


3.2688 


0.3611 


1.2474 


2.3377 


1.1672 


3.9341 


Family 3 


160 


6.2109 


0.3016 


1.4624 


1.5602 


1.3112 


2.6764 


Family 4 


160 


7.1253 


0.3027 


2.7324 


2.8338 


2.5634 


4.9129 


Family 5 


160 


2.9795 


0.2565 


1.7093 


2.4085 


1.7268 


2.4354 


Family 6 


160 


3.3299 


0.2568 


2.8662 


3.8108 


3.0338 


3.8821 


Family 7 


160 


7.0776 


0.3016 


6.9904 


7.8782 


7.9787 


12.0041 



7 Algorithms of Factorial Domination Nnmber 

An equivalent of the following notion of domination number of an algorithm was 
introduced by Punnen [3 Glover and Punnen Q. The domination number, 
domn(A, n), of an approximation algorithm A for the TSP is the maximum 
integer d = d{n) such that, for every instance X of the TSP on n cities, A 
produces a tour T which is not worse than at least d tours in X including T 
itself. Clearly, every exact TSP algorithm is of domination number {n — 1)!. 
Thus, the domination number of an algorithm close to {n — 1)! may indicate 
that the algorithm is of high quality. 

As we pointed out in Section 0 the heuristic RPC has a large domination 
number. Let ci be the number of cycles in the z-th cycle factor (the first one 
is F) and let m be the number of cycle factors derived in RPC. Then one can 
show that the domination number of the tour constructed by RPC is at least 
(n — Cl — l)!(ci — C 2 — l)!...(cm-i — Cm — 1)! ^3- This number is quite large 
when the number of cycles in cycle factors is small (which is often the case for 
’pure’ asymmetric instances of the TSP). Still, when most of the cycles are of 
length two, the above product becomes not so ’big’. It was proved ^3 that 
domn(i?PC, n) = I7(n!/r”) for every r > 3.15. 

This result was improved in ^ to domn(;B, n) = f?(n!/r”) for every r > 1.5, 
where is a polynomial approximation TSP algorithm introduced in 0. In [^, 
Gutin and Yeo introduced a polynomial time heuristic GEA and proved that 
doum{GEA,n) > {n — 2)1, resolving a conjecture of Glover and Punnen Q. 
Using one of the main results in Punnen and Kabadi ^3 showed that some 
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well-known heuristics, including RI and KSP, are of domination number at least 
(n-2)!. 

Gutin and Yeo introduced another polynomial time TSP algorithm. The 
algorithm is demonstrated to be of domination number (n— l)!/2. Note that the 
proof is based on a reported yet unpublished theorem by R. Haggkvist o on 
Hamiltonian decompositions of regular digraphs. Also, the algorithm is some- 
what slow and thus impractical. Still, we believe that this result in X0’| indicates 
a possibility for improvement on currently known construction heuristics for the 
TSP. 

Heuristics yielding tours with exponential yet much smaller domination num- 
bers were introduced in the literature on so-called exponential neighbourhoods 
for the TSP (for a comprehensive survey of the topic, see . Exponential neigh- 
bourhood local search has already shown its high computational potential 

for the TSP (see, e.g., QQ). 

8 Conclusions 

The results of our computational experiments show clearly that our combined 
algorithm COP can be used for wide variety of the TSP instances as a fast 
heuristic of good quality. It also demonstrates that theoretical investigation of 
algorithms that produce solutions of exponential domination number can be used 
in practice to design effective and efficient construction heuristics for the TSP. 
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Abstract. In this work we study the problem of counting the number 
of mobile hosts in mobile networks. Mobile networks aim to provide con- 
tinuous network connectivity to users regardless of their location. Host 
mobility introduces a number of new features and requirements for the 
distributed algorithms. In this case, the use of conventional distributed 
algorithms from mobile hosts results in a number of serious drawbacks. 
The two tier principle has been proposed (see [^) to overcome these prob- 
lems. The use of this principle for structuring distributed algorithms for 
mobile hosts means that the computation and communications require- 
ments of an algorithm is borne by the static hosts to the maximum 
extend possible. 

The Distributed Systems Platform (DSP) is a software platform that 
has been designed for the implementation, simulation and testing of dis- 
tributed protocols. It offers a set of subtools which permit the researcher 
and the protocol designer to work under a familiar graphical and al- 
gorithmic environment. The use of DSP gave us considerable input and 
permitted us to experimentally test the two tier principle for the counting 
problem of mobile hosts. Moreover it helped us to design new distributed 
algorithms for this problem, improve them and experimentally test them, 
validating their performance under various conditions. 



1 Introduction 

Generally there is a considerable gap between the theoretical results of Dis- 
tributed Gomputing and the implemented protocols, especially in the case of 
networks of thousands of nodes. On the other hand, well-designed tools would 
possibly offer to the researchers a more practical view of the existing problems 
in this area, and this, in turn, could give better (in the content of flexibility and 
efficiency) protocol design. Our work shows that a platform, suitably designed, 
can become a flexible tool for the researcher and offer a valuable help both in 
the verification and the extension of theoretical results (see also M)- 

* This work was partially supported by the EU ESPRIT LTR ALCOM-IT. (contract 
No. 20244). 
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The Distributed System Platform (DSP) is a software tool designed and de- 
veloped during the sequel of ALCOmQ projects and took its current form as 
a platform during the ALCOM-IT project. It provides an integrated environ- 
ment for the implementation, simulation and testing of distributed systems and 
protocols. The DSP offers an integrated graphical environment for the design 
and implementation of simulation experiments of various ranges. It can provide 
visualization (animation) for networks of restricted number of nodes, or support 
experiments with networks of hundreds or thousands of nodes. It provides a set 
of simple, algorithmic languages which can describe the topology and the be- 
haviour of distributed systems and it can support the testing process (on line 
simulation management, selective tracing and presentation of results) during the 
execution of specific and complex simulation scenarios. The DSP can support 
the hierarchical simulation of more than one type of protocols at the same exe- 
cution. The latter is suggested in the case of pipelined protocols (the protocols 
of the upper level use the final output of the protocols of the lower level, e.g. 
leader election and counting protocols) or layered protocols ( the protocols of 
the upper level call and use in every step the protocols of the lower level, e.g. 
synchronizers). Moreover, in its last version DSP supports the simulation of mo- 
bile protocols. The reader can find more about DSP in Section 3 and in 0 , till 
and E3- 

In this work we use this platform for the design, testing and verification of 
distributed protocols related with mobile computing. The problem of process or 
node counting (size of network) is extensively studied in the case of networks with 
static hosts. Various solutions have been proposed in the past (i.e. see 
depending on the model and the assumptions of the fixed network (i.e. timing 
conditions, network topology, existence of a network leader, dynamic or static 
network, distinguishable processes or not). The problem of process counting is 
one of the most fundamental problems in network control. Also note that the 
mobility of the hosts introduces new technical difficulties which, at first level, 
seem to invalidate the known solutions to the problem. 

In the case of mobile networks the problem seems to preserve its significance. 
Indeed, the knowledge of how many mobile users are currently connected in a 
mobile networks is generally valuable and can be used both by the control (i.e. 
routing, data management) and the application (i.e. sales and inventory appli- 
cations, see O) level of the network. Furthermore, this fundamental problem 
gives a good insight on the methodology of applying several distributed proto- 
cols to mobile networks and their performance. We believe that solutions of this 
problem in the mobile network settings will provide basic building blocks for 
solutions to more complicated problems (e.g. election of a leader etc.). 

To facilitate continuous network coverage for mobile hosts, a static network 
is augmented with mobile support stations or MSSs that are capable of directly 
communicating with Mobile Hosts (MHs) within a limited geographical area 
(’’cell”). In effect, MSSs serve as access points for an MH to connect to the static 
network and the cell, from which a MH connects to the static network, represents 

^ The ALCOM Projects are basic research projects funded by the European Union. 
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its current ’’location”. MHs are thereby able to connect to the static segment of 
the network from different locations at different time. Consequently, the overall 
network topology changes dynamically as MHs move from one cell to another. 
This implies that distributed algorithms for a mobile computing environment 
cannot assume that a host maintains a fixed and universally known location in 
the network at all time; a mobile host must be first located (’’searched”) before 
a message can be delivered to it. Furthermore, as hosts change their locations, 
the physical connectivity of the network changes. Hence, any logical structure, 
which many distributed algorithm exploit, cannot be statically mapped to a set 
of physical connections within the network. Second, bandwidth of the wireless 
link connecting an MH to an MSS is significantly lower than that of (’’wired”) 
links between static hosts Third, mobile hosts have tight constraints on 

power consumption relative to desktop machines since they usually 

operate on stand-alone sources such as battery cell. Consequently, they often 
operate in a ’’doze mode” or voluntarily disconnect from the network. Lastly, 
transmission and reception of messages over the wireless link also consumes 
power at an MH, and so distributed algorithms need to minimize communication 
over the wireless links. These aspects are characteristic of mobile computing and 
need to be considered in the design of distributed algorithms. 

The main result of this work is a new correct and efficient (in number of 
messages) distributed protocol for the problem of counting mobile hosts. A sig- 
nificant part of the work is a demonstration of some principles of Distributed 
Algorithmic Engineering (see also C3)- Specifically, starting from the two tier 
principle we show how to successfully modify classical solutions aided by the use 
of the DSP and by experiments conducted on this platform. 

The remainder of this work is organized as follows. In Section 2 we give the 
system model for mobile networks with fixed base stations. A brief description 
of the DSP tool is presented in Section 3. Section 4 contains the presentation of 
a counting algorithm executed distributedly by the mobile hosts. In Section 5 a 
counting protocol based in the two tier principle is presented. Finally, Section 6 
contains simulation results and enhancements for the two tier protocol. 



2 The System Model 

A host that can move while retaining its network connections is a mobile host 
( 0 ). The geographical area that is served by the fixed base station network is 
divided into smaller regions called cells. Each cell has a base station also referred 
to as the Mobile Service Station (MSS) of the cell. All mobile hosts that have 
identified themselves with a particular MSS are considered to be local to the 
cell of this MSS. At any instance of time, a mobile host belong to only one cell. 
When a mobile host enters new cell it sends a < join{mh — id) > message to 
the new MSS. Therefore it is added in the list of local mobile hosts of this MSS. 

Each MSS is connected to the service stations of neighbouring cells by a 
fixed high bandwidth network. The communication between a mobile host and 
its MSS is based on the use of low-bandwidth wireless channels. If a mobile host 
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hi wants to send a message to another mobile host ft .2 it first sends the message 
to the local MSS over a wireless channel. The MSS then forwards through the 
fixed network the message to the local MSS of ft .2 which forwards it to /i 2 over its 
local wireless channels. Since the location of a mobile host within the network 
is neither fixed nor universally known at the whole network (its ’’current” cell 
may change with every move), the local MSS of hi needs first to determine the 
MSS that currently serves ft. 2 . This means that for each message transmission 
between two mobile hosts incurs also a search cost. The following notation has 
been proposed in Q for the description of the cost of messages exchanged in the 
network: 

— C fixed'- The cost of sending a point-to-point message between any two fixed 
hosts. 

— C wireless- The cost of Sending a message from a mobile host to its local MSS 
over a wireless channel (and vice versa). An extra cost may incur in order 
to allocate the wireless channels (C3, QQ ). 

— C search - The cost (messages exchanged among the MSSs of the fixed network) 
to locate a mobile host and forward a message to its current local MSS. 
We consider that C search = a-C fixed where a a constant depending on the 
location management strategy used (e.g. [^). 

Let G{V,E), where \V\ = n and \E\ — O(n^), be the graph which describes 
the fixed part of the network (the network of the MSSs) . Each vertex models an 
MSS. There exist an edge between two vertices if and only if the corresponding 
MSSs communicate directly (point-to-point) in the fixed network. Let m be the 
number of mobile hosts (we suppose that usually m >> n). We define by D the 
diameter of G. 

Based on the above notation, the search cost G search is approximately 0{D). 
A message sent from a mobile host to another mobile host incurs a cost 2Gwireiess 
+G search- This means that any algorithm based on the communication between 
mobile hosts requires a large number of messages to be exchanged over the fixed 
network and the wireless channels. 

3 A Brief Description of the DSP Tool 

DSP is a software tool that provides an integrated environment for the simulation 
and testing of distributed protocols. It follows the principles proposed by the 
books of G. Tel N. Lynch (E3) and aims in describing precisely and 

concisely the relevant aspects of a whole class of distributed computing systems. 
A distributed computation is considered to be a collection of discrete events, 
each event being an atomic change in the state of the whole system. This notion 
is captured by the definition of transmission systems. What makes such a system 
distributed is that each transmission is only influenced by, and only influences, 
part of the state, basically the local state of a single process. In DSP, each 
process is represented by a finite state machine with a local transmission table. 
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Events affecting transmissions include the arrival of a message at a node, time- 
outs, node (and link) failures and mobile process movement. The power of the 
processes is not limited in any other way since they are allowed to have local 
memory and (unbounded) local registers. 

The platform allows in addition the modeling of mobile processes, the calling 
of a DSP library protocol from another user protocol and user control of local 
(virtual) clocks. The DSP platform thus differs from all existing ’’simulators” 
or ” languages” of distributed systems because of its generality, algorithmic sim- 
plicity and clarity of semantics of its supported features. It aims in providing 
to the distributed algorithms designer an ideal environment of what a general 
distributed system ”is expected to be”. 

The basic components of the platform include: I. A set of algorithmic lan- 
guages that allow the description of a distributed protocol and the specification 
of a distributed system. II. A discrete event simulator that simulates the execu- 
tion of a specified distributed protocol on a specified distributed system. III. A 
data base for distributed protocols that can also be used as a distributed protocol 
library. IV. Graphical user interface for the protocol and topology specification 
and the interaction during the simulation. 

The DSP can support simulation and testing of protocols for fixed base sta- 
tion mobile networks in the following way: The user can describe MSSs and 
Mobile Hosts as static and mobile DSP processes respectively. Given a network 
topology the nodes can be specified to execute the MSS static process protocol 
in order to form the fixed base station network. Static and mobile processes may 
communicate by the use of radio messages. When a mobile process transmits a 
radio message this message is received from the static process on the node where 
the mobile process currently resides. On the other hand, a static process like an 
MSS can transmit radio messages which are received from all mobile processes 
in its node. By this way the DSP simulates the structure of cells. Any number 
of mobile processes (and thus Mobile Hosts) may be created in the network. 
The mobile processes may move from one node to another without restrictions 
representing the move of a mobile host from one cell to another. 

4 The Virtual Topology Algorithm: A Counting 

Algorithm for Fixed Networks Executed Distributedly 
by the Mobile Hosts 

Suppose that m Mobile Hosts are moving throughout a fixed base station mobile 
network. One of the mobile hosts (the initiator of the algorithm) wants to find 
the size of the mobile network (the number of the Mobile Hosts). We assume 
that the communication between the MSSs is based on the asynchronous timing 
model. We also assume that the Mobile Hosts are willing to control by themselves 
the execution by avoiding the participation of the MSSs. Note that, sometimes, 
this would be the case if the protocol requires computational power that will 
increase the overhead of the MSSs (since their basic activity is to control the 
communications in the network) . 
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int routing-table []■, the routing table of the MSS 
set locaLMH’, the set of local MHs of the MSS 

do forever 

on receive < join, mh — id > 
insert mh-id to local-MH\ 

execute the routing protocol with the other MSSs to update their routing tables: 
on receive < message, receiver — mh > 
if (receiver-mh is in locaLMH) then 
transmit < message > to the cell; 
else 

look up the routing-table to find the next MSS i for this receiver-mh 
send < message, receiver — mh > to i; 

end; 



Fig. 1. The protocol executed by the MSSs 

A fundamental algorithm for the solution of the counting problem in fixed 
networks can be found in (p.l90). It is basically a wave algorithm which can 
be used if a unique starter process (an initiator) is available. The application of 
this algorithm in a mobile network implies the definition of a ’’virtual” topology 
on the mobile hosts (each mobile host should be somehow assigned a set of 
’’neighbours” at the beginning of the execution). By assuming that the virtual 
topology has 0{m) ’’edges”, the total cost of the protocol would approximately 
be 0{m)Cyjireiess + 0{Dm)C fixed- The appearance of the D factor is due to the 
search cost in the fixed network. 

Besides the high cost of message transmissions in the fixed network, another 
drawback of this approach is that it requires the participation of every mobile 
host in order to maintain the connectivity of the virtual topology and cannot 
therefore permit to any one of them to disconnect during the execution (which is 
very usual in the case of mobile hosts). Thus, the total protocol execution time 
is a very crucial parameter since it increases the consumption of battery power 
in the mobile hosts. 

The Mobile Service Stations act mainly as routers by forwarding messages 
from one Mobile Host to another and updating their routing tables when a 
mobile host joins their cell. The protocol executed by the MSSs is presented in 
FigureQ 



4.1 Test Case Scenarios and Measurement Parameters 

An important factor for the performance of the protocols executed in a mobile 
network is the speed and the type of movement of the mobile hosts. Obviously 
when a mobile host is moving fast it will change many cells during the execution 
of the protocol and thus the overhead of keeping the routing information updated 
will grow. 

Since we could not describe the speed of the mobile hosts in terms of physics 
(e.g. in kilometers or miles per hour) our approach was to associate the speed 
of the hosts with the propagation delay of messages in the fixed network. We 
consider a slow mobile host as a host which does not change its cell for 0{D) 
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time units, where D is the diameter of the fixed network. Practically this means 
that the host keeps its position during the time interval required for a message 
to propagate from its host MSS to any other MSS. A fast mobile host is assumed 
to be a host which moves from a cell to another in time much less than 0{D). 

In order to conduct the experiments we prepared five different topologies with 
different node cardinality ranging from 20 to 100 nodes (MSSs). The transmission 
delay in all links was unary on order to avoid the overhead of message delay in 
the protocol execution time. In all cases the protocol was initialized with ten 
mobile hosts on each node. The experiment parameters are presented in Figure 
2 . 
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Fig. 2. The parameters used in the experiments 

For all the experiments the measurements concern the following parameters: 

— The number of messages exchanged in the fixed network in order to deliver 
messages between the mobile hosts. We did not take into account messages 
used from the routing protocol since they are irrelevant to the execution of 
the basic counting protocol itself. 

— The number of radio messages transmitted by the mobile hosts. In this case 
we did not count the < join > messages since they are also used for network 
control and routing purposes. 

— The execution time of the protocol. 

The consumption of battery power of a Mobile Host during the execution of 
the protocol is expressed by the last two parameters. It depends on the number 
of transmissions made by the host and the time that the host remains active 
(protocol execution time). 

4.2 The DSP Simulation Settings of the Protocol 

We will omit the details of this implementation due to the lack of space. An 
example of the methodology used for protocol specification in DSP is presented 
in the Appendix. In order to simplify the protocol, we did not implement a 
separate distributed routing protocol since the behavior of the routing part and 
the messages exchanged for the update of routing tables were not included in our 
measurements. Instead, we used a global routing table shared among all MSSs. 
This table was updated by a separate ’’daemon” routing process. 

At the beginning of the simulation the mobile hosts were left to move ran- 
domly in the network in order to take random positions before the protocol 
was started. After the beginning of the execution, the virtual topology was con- 
structed in the following way. Each Mobile Host with identity i considered as 
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int size—Q\ the final size of the network 

boolean counted=fa\se; a flag indicating if the host has been counted 
the initiator mobile host: 

begin 

broadcast < count >; 
on receive < size, s > 
size=s; 

end 

the other mobile hosts'. 

begin 

on receive < count > 
if (not counted) 

begin 

broadcast < count jne >; 
counted=true; 

end 

on receive < size, s > 



size=s; 

end 



Fig. 3. The counting protocol executed by the mobile hosts 
” neighbours” the Mobile Hostswith identities i—1 and z + 1 constructing a ” vir- 
tual” line. This is an efficient Q virtual topology for this case since it contains 
m — 1 edges. 

5 A New Counting Protocol Based on the Two Tier 
Principle 

If the protocol execution is not very complex to require much of the MSSs com- 
putational power, more efficient solutions can be provided. A guiding principle 
for this case was presented in (Q) and is called the two tier principle: 

Computation and communication costs of an algorithm should be based on the 
static portion of the network. This leads to avoid locating a mobile participant 
and lowers the total search cost of the algorithm. Additionally, the number of 
operations performed at the mobile hosts is decreased and thereby consumption of 
battery power (which is critical resource for mobile hosts) is kept to a minimum. 

The application of this simple principle on the design of distributed algo- 
rithms for mobile hosts has been studied in in the case of a classical algo- 
rithm for mutual exclusion in distributed systems (Le Lann’s token ring, ^3)- 
In this section we propose and we study the behaviour of a protocol based on 
this principle that solves the counting problem in a mobile network with fixed 
base stations. 

The proposed counting scheme is based on the execution of the Echo protocol. 
The protocol executed by the mobile hosts is presented in Figure Q As can be 

^ The ’’star” topology would seem to be more appropriate but in this case one Mobile 
Host (the one in the middle of the ’’star”) would already know the size of the network, 
since all the other hosts would be its neighbours. 
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seen, the execution is started by the initiator Mobile Host which broadcasts a 

< count > message. Afterwards, the initiator itself does not respond to any 

< count > message. The execution on the fixed part of the network (the MSSs) 
is described as follows: 

1. In the first phase, the initiator MSS (the MSS serving the initiator) broad- 
casts a < count > message in its cell and then spreads along the base station 
network the request for counting by using the Echo algorithm. The algorithm 
starts by sending < countJtok > messages to all neighbouring MSSs. 

2. Upon receiving such a message (< count Jok >), an MSS broadcasts a < 
count > message to its cell and waits to collect answers from mobile hosts 
in this cell (we assume that an operational mobile host responds to the 

< count > message with a < count jme > message immediately). The MSS 
also forwards a < countdok > message to its neighbours in order to continue 
the execution of the Echo. When the MSS receives a < countjme > message 
it increases sizcp, the number of counted mobile hosts in its cell. The MSSs 
that become ’’leaves” in the execution of the Echo respond with a message 

< sizcjsizcp > to their ’’parents”. When a ’’parent” receives a < size,s > 
message from a ’’child”, it adds s to < sizCp >. Upon collecting answers from 
all its ’’children” it reports a < size, sizCp > message to its own ’’parent”. 

3. After the completion of the Echo the initiator MSS knows the total number 
of mobile hosts in the network (its < sizCp > variable). 

4. The initiator base station broadcasts a < size, sizep > message in its cell 
and then forwards an < inform-tok, sizep > message to its ’’children” . Upon 
receiving such a message an MSS broadcasts a < size, sizep > message to 
its cell and forwards a < inf orm-tok, sizep > message to its ’’children” in 
order to continue the execution. After the propagation of the inform_tok 
messages in the base station network, all the MSSs have broadcasted the size 
of the mobile network in their cells and the mobile hosts have been informed 
about it. 

It is very easy to see that a mobile host cannot be counted twice even if 
it moves throughout the network during the execution of the protocol (it will 
respond only to one < count > message). The above protocol requires m broad- 
casts by the mobile hosts. The protocol needs 3E = O(n^) messages to be 
exchanged in the fixed network yielding a total cost of mC wireless + 3\E\C fixed 
which approximately is 0{m)Cwireiess + 0{n^)C fixed- The protocol is completed 
in time 3D. The implementation of the protocol is presented in detail in the Ap- 
pendix. 



6 Simulation Results and Algorithm Enhancements 

6.1 The Case of Slow Mobile Hosts 

Both protocols were executed 100 times for each one of the topologies described 
in Figure 2. The simulations proved the correctness of both algorithms, i.e. the 
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total number of Mobile Hosts was reported correctly by the initiator host in all 
cases. As expected, the number of messages exchanged by the two tier algorithm 
(tta) remained almost unchanged in all simulations, while in the case of the vir- 
tual topology algorithm ( vta ) the number of messages exchanged varied highly, 
depending on the random ’’virtual topology” of the mobile hosts. Furthermore, 
the simulation results show the remarkable advantage of the two tier algorithm 
over the virtual topology algorithm in both, battery consumption (yielding from 
the much smaller execution time and the fifty percent lower radio message trans- 
missions) and fixed network load. In particular, the two tier algorithm showed a 
remarkable decrease of total messages and battery consumption in all cases. 

The simulations lead to the conclusion that in the case of slow mobile hosts, 
a simple two tier echo algorithm suffices to count the Mobile Hosts correctly and 
efficiently. The simulation results are presented in Figure 4. 
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Fig. 4. Simulation results for slow mobile hosts (mean values over 100 experi- 
ments) 



6.2 The Case of Fast Mobile Hosts: The Discovery of an Error 

As in the previous case, both protocols were executed 100 times for each one 
of the topologies of Figure 2. The virtual topology algorithm was proven to 
be correct, i.e. the total number of Mobile Hosts was computed correctly in all 
simulations. The number of messages exchanged in the fixed network showed the 
same behaviour as in the first case. On the other hand, the two tier algorithm, 
despite being faster in all simulations, did not compute the total number of 
Mobile Hosts correctly in all simulations. In fact, the calculated number was in 
many cases slightly less than the total number of Mobile Hosts in the network. 
This can also be observed by examining the total number of radio message 
transmissions which was lower than the number of Mobile Hosts, revealing that 
some Mobile Hosts did not participate in the execution. The results of some of 
these runs are shown in Figure 5. 

By using the DSP debugging facilities, the problem of the two tier algorithm 
could be traced. In fact, the situation can be described as follows: Suppose that 
MSSs Si and S 2 complete the execution of the MSS protocol steps 1 and 2 at 
time ti and t 2 respectively, where t 2 > ti. It is possible that a mobile host starts 
moving from the cell of S 2 towards the cell of at time < ti and appears at 
the cell of Si at time t '2 where ti < t '2 < t 2 - This may happen if the path from 
the root to is shorter than the one to S 2 - In this case the mobile host does 
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Fig. 5. Radio transmissions of the two tier protocol in the case of fast mobile 
hosts) 

not participate in the algorithm even if it is willing to do. This situation does 
not arise in the case of slow Mobile Hosts because practically they do not move 
during this time interval. 

6.3 A Corrected Two Tier Algorithm 

The solution to the problem was to maintain each MSS active (transmit < 
count > and receive < countjme > messages) until all other MSSs are informed 
for the protocol execution (have received < count J,ok > messages). In order to 
eliminate the faulty behaviour, the protocol was modified in the following way: 

1. After a base station has been informed about the execution of a counting 
protocol (by receiving a < count Jtok > message) in the network it behaves 
as follows: By receiving a < countjme > message it increases the number of 
counted mobile hosts in its cell. If a base station receives a < join{mh—id) > 
message (from a mobile host joining its cell) it broadcasts again a < count >. 

2. After the completion of the first Echo, all of the MSSs have been informed 
about the execution of a counting algorithm in the network and have broad- 
casted a < count > message in their cell. 

3. The initiator base station starts a second execution of the Echo algorithm 
by sending a < size-tok, 0 > message to its neighbours. This execution aims 
to collect the sizcp variables from all MSSs to the initiator. After completing 
the execution of the second Echo (by receiving answers from all children and 
sending its sizCp variable to its parent as in step 2 of the previous version), 
an MSS stops to broadcast < count > messages when a new mobile host 
joins its cell. 

4. In order to avoid the appearance of the previous problem , the informing 
phase is also implemented by using a double execution of the Echo. The 
initiator base station broadcasts a < size, sizep > message in its cell and 
then starts the third execution of the Echo by sending a < inform-tok, 
sizCp > message to its neighbours. Upon receiving such a message, an 
MSS broadcasts a < size, sizep > message to its cell and forwards an 
< inform-tok, sizep > message to its neighbours in order to continue the 
execution. If a base station receives a < join{mh — id) > message, it broad- 
casts the < size, sizep > again. After the third completion of the Echo, all 
the MSSs have broadcasted the size of the mobile network in their cells. The 
initiator starts a fourth execution of the Echo to inform the MSSs about 
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the completion of the counting. After the completion of the fourth Echo an 
MSSs stops to broadcast < size > messages when a new mobile host joins 
its cell. 

Lemma 1. The modified two tier protocol counts the number of Mobile Hosts 
correctly. 

Proof: We consider an MSS as notified if it has received a < count Jtok > message 
and therefore collects answers and broadcasts < count > messages in its cell. If 
a mobile host travels from an MSS S 2 which is not notified to an MSS Si which 
is notified, Si still waits to collect answers (the execution of the first Echo has 
not been completed yet since S 2 is not notified) and therefore the mobile host 
will receive a < count > from and reply to with a < count jme > message. 
■ 

By the nature of the Echo algorithm which is executed four times the protocol 
needs 8if = O(n^) messages to be exchanged in the fixed network yielding a total 
cost of mC wireless + ^\E\C fixed which approximately is still 0{m)Cwireiess + 
0{n^)C fixed- The protocol is completed in time 8D. 

The simulation results on this version of the protocol are presented in Figure 
6. The number of radio message transmissions verifies that the Mobile Hosts are 
counted correctly, with a linear increase of the fixed network messages and the 
execution time. However, both performance parameters show a good advantage 
over the virtual topology algorithm, suggesting the application of the two tier 
principle even in the case of fast Mobile Hosts. 
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Fig. 6. Simulation results of the corrected two tier protocol 



7 Conclusions and Future Work 

In this work we studied how to use distributed algorithmic engineering paradigms 
(such as the two tier principle) supported by the Distributed Systems Platform 
and suitable experiments in order to arrive at a new and more efficient Mobile 
Host counting protocol. Current work of ours extends this methodology to the 
problem of counting the Mobile Hosts in ad-hoc networks, where the MSSs are 
missing. 



Acknowledgments 

We wish to thank Richard Tan for inspiring discussions about the issue. 




Counting in Mobile Networks: Theory and Experimentation 107 



References 

1. ’’Impact of mobility on Distributed Computations”, A. Acharya, B. R. Badrinath, 
T. Imielinski, Operating Systems Review, April 1993. 

2. ’’Structuring distributed algorithms for Mobile Hosts”, A. Acharya, B. R. Badri- 
nath, T. Imielinski, 14th International Conference on Distributed Computing sys- 
tems, June 1994. 

3. ’’Concurrent online tracking of mobile users”, B. Awerbuch, D. Peleg. 

4. ’’IP-based protocols for mobile internetworking”, D. Duchamp, G. Q. Maquire, J. 
loannidis, In Proc. of ACM SIGCOM, September 1991. 

5. ’’Brief Announcement: Fundamental Distributed Protocols in Mobile Networks”, 
K. Hatzis, G. Pentaris, P. Spirakis, V. Tampakas, R. Tan, Eighteenth ACM Sym- 
posium on Principles of Distributed Computing (PODC ’99), Atlanta, GA, USA. 

6. ’’Fundamental Control Algorithms in Mobile Networks”, K. Hatzis, G. Pentaris, P. 
Spirakis, V. Tampakas, R. Tan, Eleventh ACM Symposium on Parallel Algorithms 
and Architectures (SPAA 99), June 27-30 1999, Saint-Malo, France. 

7. “The description of a distributed algorithm under the DSP tool: The BNF nota- 
tion”, ALCOM-IT Technical Report, 1996. 

8. “The design of the DSP tool”, ALCOM-IT Technical Report, 1996. 

9. “DSP: Programming manual”, 1998. 

10. “The specifications of the DSP tool, ALCOM-IT Technical Report, 1996. 

11. ’’Mobile Computing”, T. Imielinski, H. F. Korth, Kluwer Academic Publishers, 
1996. 

12. ’’Distributed Systems towards a formal approach”, G. Le Lann, IFIP Congress, 
1977. 

13. Nancy Lynch, “Distributed algorithms” , Morgan Kaufmann Publishers, 1996. 

14. Gerard Tel, “Introduction to distributed algorithms”, Cambridge University Press, 
1994. 

15. ’’Distributed Dynamic Channel Allocation for Mobile Computing”, R. Prakash, N. 
Shivaratri, M. Sighal, In Proc. of ACM PODC 1995. 

16. ’’Competitive Call Control in Mobile Networks”, G. Pantziou, G. Pentaris, P. Spi- 
rakis, 8th International Symposium on Algorithms and Gomputation, ISAAC 97, 
December 1997. 

17. ’’Symmetry Breaking in Distributive Networks”, A. Itai and M. Rodeh, In pro- 
ceedings of 22nd FOCS, 1981, pp 150-158. 

18. ’’Approximating the Size of a Dynamically Growing Asynchronous Distributed 
Network”, B. Awerbuch and S A. Plotkin, Technical Report MIT/LGS/TM-328, 
April 1987. 

19. ’’Implementation and testing of eavesdropper protocols using the DSP tool”, 
K. Hatzis, G. Pentaris, P. Spirakis and V. Tampakas, In Workshop on Algo- 
tithmic Engineering WAE 98, Saarbruecken, Aug. 1998. Postcript version in 
http://helios.cti.gr/alcom-it/foundation 

20. ’’Agent -Mediated Message Passing for constraint Environment”, A. Athas and D. 
Duchamp, In USENIX Symposium on Mobile and Location Independent Comput- 
ing, Aug. 1993. 

21. ’’Unix for Nomads: Making Unix support mobile computing”, M. Bender et ah, In 
USENIX Symposium on Mobile and Location Independent Computing, Aug. 1993. 

22. ’’Power efficient filtering on data on the air” , T. Imielinski et al. , In EDBT’94, 
1994. 

23. ” System Issues in Mobile Computing” B. Marsh, F. Doughs, and R. Caceres, T.R. 
MITL-TR-50-93, MITL, 1993. 

24. DSP Web Site, http://helios.cti.gr/alcom-it/dsp 




108 K. Hatzis, G. Pentaris, P. Spirakis, B. Tampakas 



Appendix: The 
Implementation of the 
Two-Tier Protocol Under 
the DSP Environment 

Protocol specification 

The implementation is based on the protocol de- 
scription language of the DSP. This language pro- 
vides the ability to describe a protocol in an al- 
gorithmic form, similar to the one met in the lit- 
erature for distributed systems. It includes usual 
statements met in programming languages like 
C and Pascal, and special structures for the de- 
scription of distributed protocols (e.g. states, ti- 
mers) and communication primitives (e.g. shared 
variables, messages). The statements of the lan- 
guage support the use of these structures during 
the specification of a distributed protocol in the 
areas of data modelling, communication, queu- 
ing, process and resource management (e.g. send 
a message through a link, execute a function as 
an effect of a process state transition ). Some 
of the language statements support the interface 
of the specified protocol with the DSP graphical 
environment (e.g. show a node or a link with dif- 
ferent color). The BNF notation of this language 
is presented in and a more defiled descrip- 
tion is given in the user manual (^J). 

The basic object of the DSP protocol de- 
scription language is the process. The processes 
can be static (residing permanently on a node) 
or mobile (moving throughout the network). Our 
protocol involves two types of processes, a static 
(the Mobile Service Station) and a mobile one 
(the Mobile Host). Each one of them is described 
separately in the protocol file as a distinct object. 

The protocol starts with the TITLE defini- 
tion. It is followed by the definition of the mes- 
sages used which are count, count^me, count^tok, 
size and inform^tok. The MobileHIost is defined 
to be a mobile process. The INIT procedure is 
executed during the initialization of the process 
and includes the initialization of the process vari- 
ables by assigning them constant values, or val- 
ues created by specific language statements. The 
MobileHIost intializes its counted flag to FALSE 
and stores its identity in local memory. The pro- 
cedure PROTOCOL includes the core algorithm 
executed by the process. It is an event-driven 
procedure, which means that for every specific 
event that the process is expected to handle, a 
corresponding set of statements is executed. In 
this case the Mobile^Host process handles the 
following events: 

- INITIALIZE which indicates the awakening 
of the process. Only one mobile host will 
process this event (the initiator) and will 
transmit a < count > message. 

- MOBILE_RECEIVE_RADIO_MESSAGE 
which means that the process receives a mes- 
sage broadcasted by an MSS. The logical 
structure of the statements executed in this 
case is similar to the one presented in Figure 

- MOBILE_PROCESS_ARRIVAL meaning 
that a mobile process arrived on a node. 
If this event concerns the mobile process 
which currently executes the protocol it 



starts an internal timer timer J,o^stay to re- 
main on its current node for stay^periodtirae 
units. 

— TIME.OUT which indicates the time out of 
the internal timer. In this case the process 
moves to a (randomly chosen) neighbour. In 
the case of slow mobile hosts the timeout 
period was arranged to 2D, while in the case 
of fast mobile hosts this period was adjusted 
to one time unit. 

The Mobile^ ervice^Station process is a 
static process and it is implemented in a simi- 
lar way. The interested reader may find the cor- 
respondence between the DSP specification and 
th^description of a similar algorithm presented 
in (page 190). 

A ^TATIC.RECEIVE.RADIO.MESSAGE 
event concerns a radio message which has been 
transmitted by a Mobile^Host in the cell while a 
RECEIVE^MESSAGE event indicates a mes- 
sage received from the fixed network. The initia- 
tor MSS calls the procedure wakejup{) in order 
to start the execution of the Echo. On the sec- 
ond phase of the execution (while size messages 
are propagating from the leaves towards the root) 
each MSS identifies the set of its children which is 
then used for the propagation of the inform^tok 
messages. 

TITLE ” A counting protocol for mobile net- 
works with fixed base stations”; 

MESSAGE count END; 

MESSAGE count_me END; 

MESSAGE count_tok END; 

MESSAGE size 
INT sizep; END; 

MESSAGE inform_tok 
INT info; END; 

MOBILE PROCESS Mobile_Host 
BEGIN 

STATE dummy; 

TIMER timer_to_stay ; 

CONST 

stay_period=100 ; 

VAR 

INT size, my_id, arr_id, neigh; 

BOOLEAN counted; 

PROCEDURE wake_up() ; 

BEGIN 

TRANSMIT_NEW_ MESSAGE count 
BEGIN END; 

END; 

PROCEDURE moveO; 

BEGIN 

PUT_RAND0M_NEIGHB0UR TO neigh; 
M0VE_T0_N0DE neigh; 

END; 



INITO ; 

BEGIN 

size=0; 

PUT_MY_PROCESS_ID TO my_id; 
counted=FALSE ; 

END; 
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PROTOCOL 0 ; 

BEGIN 

ON EVENT INITIALIZE DO CALL wake_up(); 

ON EVENT MOBILE_RECEIVE_RADIO_MESSAGE DO 
BEGIN 

IF (CURRENT_MESSAGE_TYPE==count) THEN 
IF (counted==FALSE) THEN 
BEGIN 

TRANSMIT_NEW_MESSAGE count _me 
BEGIN END; 
counted=TRUE ; 

END; 

IF (CURRENT_MESSAGE_TYPE==size) THEN 
size=CURRENT_MESSAGE_DATA . sizep; 

END; 

ON EVENT M0BILE_PR0CESS_ARRIVAL DO 
BEGIN 

PUT_ID_OF_ARRIVING_PROCESS TO arr_id; 
IF (arr_id==my_id) THEN 
START timer_to_stay 
TIMEOUT stay_period; 

END; 

ON EVENT TIME.OUT DO CALL moveO; 

END; 

END; 

STATIC PROCESS Mobile_Service_Station 
BEGIN 

STATE dummy; 

CONST 

undef ined=-l ; 

VAR 

INT receivep,fatherp, sizep, q,k, 
messages_send; 

SET neighbours, children; 



END; 

IF (messages_send==0) THEN 
SEND_NEW_MESSAGE size TO fatherp; 
BEGIN size=sizep; END; 

END; 

IF (CURRENT_MESSAGE_TYPE==size) THEN 
BEGIN 

PUT_SENDER_OF_CURRENT_MESSAGE TO q; 
INSERT q IN children; 
receivep=receivep+l ; 
sizep=sizep+CURRENT_MESSAGE_DATA . size ; 
IF (receivep=messciges_send) THEN 
BEGIN 

IF (fatherp ! =undefined) THEN 
SEND_NEW_MESSAGE size TO fatherp 
BEGIN size=sizep; END; 

ELSE 

BEGIN 

TRANSMIT_NEW_MESSAGE size 
BEGIN size=sizep END; 

FOR ALL k IN children DO 
SEND_NEW_MESSAGE inform_tok TO k 
BEGIN 

info=sizep; 

END; 

END; 

END; 

END; 

IF (CURRENT_MESSAGE_TYPE==inform_tok) 
THEN 

TRANSMIT_NEW_MESSAGE size 

BEGIN size=CURRENT_MESSAGE_DATA . inf o ; 

END; 

END; 

END; 



PROCEDURE wake_up(); 

BEGIN 

SEND_NEW_MESSAGE count_tok TO 
ALL NEIGHBOURS 

BEGIN END; 

END; 

INITO ; 

BEGIN 

receivep=0; f atherp=undef ined; sizep=0 ; 
messages_send=0 ; 

PUT_ALL_NEIGHBOURS TO neighbours; 

END; 



Protocol initialization 

The initialization language of DSP provides 
statements that assigns process types (as spec- 
ified in the protocol) to nodes of the topology 
and create and place mobile processes on nodes. 
An instance of an initialization file used during 
a simulation is presented below. It is used to 
assign the Mobile^Service^Station process type 
to all nodes and create five mobile processes of 
type Mobile^Hoston each node of a network of 
20 nodes. In this example the Mobile^Host 0 is 
the initiator and is initiated at an arbitrary time 
instance between 5 and 20. 



PROTOCOL 0 ; 

BEGIN 

ON EVENT STATIC_RECEIVE_RADIO_MESSAGE DO 
BEGIN 

IF (CURRENT_MESSAGE_TYPE==count) THEN 
CALL wake_up() ; 

IF (CURRENT_MESSAGE_TYPE==count_me) THEN 
sizep=sizep+l ; 

END; 

ON EVENT RECEIVE_MESSAGE DO 
BEGIN 

IF (CURRENT_MESSAGE_TYPE==count_tok) THEN 
BEGIN 

PUT_SENDER_OF_CURRENT_MESSAGE TO q; 
fatherp= q; 

DELETE q FROM neighbours; 

FOR ALL k IN neighbours DO 
BEGIN 

SEND_NEW_MESSAGE count_tok TO k 
BEGIN END; 

messages_send=messciges_send+l ; 



INITIALIZATION FILE ”init filel” FOR 
PROTOCOL “Counting protocol based on 
the two tier principle” 



SET ALL NODES TO Mobile_Service_Station 



PUT MOBILE PROCESS Mobile_Host 
PUT MOBILE PROCESS Mobile_Host 
PUT MOBILE PROCESS Mobile_Host 
PUT MOBILE PROCESS Mobile_Host 
PUT MOBILE PROCESS Mobile_Host 
INIT MOBILE PROCESS 0 RANDOMLY 



ON NODE 0-20 
ON NODE 0-20 
ON NODE 0-20 
ON NODE 0-20 
ON NODE 0-20 
FROM 5 TO 20 
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Abstract. Traffic information systems are among the most prominent 
real-world applications of Dijkstra’s algorithm for shortest paths. We 
consider the scenario of a central information server in the realm of public 
railroad transport on wide-area networks. Such a system has to process 
a large number of on-line queries in real time. In practice, this problem 
is usually solved by heuristical variations of Dijkstra’s algorithm, which 
do not guarantee optimality. We report results from a pilot study, in 
which we focused on the travel time as the only optimization criterion. In 
this study, various optimality-preserving speed-up techniques for Dijk- 
stra’s algorithm were analyzed empirically. This analysis was based on 
the timetable data of all German trains and on a “snapshot” of half a 
million customer queriesfl 



1 Introduction 

Problem. From a theoretical viewpoint, the problem of finding a shortest path 
from one node to another one in a graph with edge lengths is satisfactorily 
solved. In fact, the Fibonacci-heap implementation of Dijkstra’s algorithm re- 
quires 0{m + nlogn) time, where n is the number of nodes and m the number 
of edges However, various practical application scenarios impose restrictions 
that make this algorithm impractical. For instance, many scenarios impose a 
strict limitation on space consumption^ 

In this paper, we consider a different scenario: space consumption is not an 
issue, but the system has to answer a potentially infinite number of customer 
queries on-line. The real-time restrictions are soft, which basically means that 
the average response time is more important than the maximum response time. 
The concrete scenario we have in mind is a central server for public railroad 

^ With special courtesy of the TLC Transport-, Informatik- und Logistik-Consulting 
GmbH/EVA-Fahrplanzentrum, a subsidiary of the Deutsche Bahn AG. 

^ To give a concrete example: if a traffic information system is to be distributed on 
CD-Rom or to be run on an embedded system, a naive implementation of Dijkstra’s 
algorithm would typically exceed the available space. 



J.S. Vitter and C.D. Zaroliagis (Eds.): WAE’99, LNCS 1668, pp. 1999- 
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transport, which has to process a large number of queries {e.g. a server that 
is directly accessible by customers through terminals in the train stations or 
through a WWW interface). 

Algorithmic problems of this kind are usually approached heuristically in 
practice, because the average response time of optimal algorithms seems to be 
inacceptable. In a new long-term project, we investigate the question to what 
extent optimality-preserving variants of Dijkstra’s algorithm have become com- 
petitive on contemporary computer technology. Here we give an experience re- 
port from a pilot study, in which we focused on the most fundamental kind 
of queries: find the fastest connection from some station A to some station B 
subject to a given earliest departure time. 

This scenario is an example of a general problem in the design of practical 
algorithms, which we discussed in El: computational studies based on artificial 
{e.g. random) data do not make much sense, because the characteristics of the 
real-world data are crucial for the success or failure of an algorithmic approach in 
a concrete use scenario. Hence, experiments on real-world data are the method 
of choice. 

Related work. Various textbooks address speed-up techniques for Dijkstra’s al- 
gorithm but have no concrete applications in mind, notably ^ and mj. In m . 
Chapter 4, a brief, introductory survey of selected techniques is given (with a 
strong bias towards the use scenario discussed here). Most work from the scien- 
tific side addresses the single-source variant, where a spanning tree of shortest 
paths from a designated root to all other nodes is to be found. Moreover, the 
main aspect addressed in work like is the choice of the data structure for the 
priority queue. In SectionQbelow (paragraph on the “search horizon”), we will 
see that the scenario considered in this paper requires algorithmic approaches 
that are fundamentally different, and SectionQwill show that the choice of the 
priority queue is a marginal aspect here. 

On the other hand, most application-oriented work in this field is commercial, 
not scientific, and there is only a small number of publications. In fact, we are 
not aware of any publication especially about algorithms for wide-area railroad 
traffic information systems. 

Some scientific work has been done on local public transport. For example, 
[3 gives some insights into the state of the art. However, local public transport 
is quite different from wide-area public transport, because the timetables are 
very regular, and the most powerful speed-up techniques are based on the strict 
periodicity of the trains, busses, ferries, etc. In contrast, our experience is that 
the timetables of the national European train com>anies are not regular enough 
to gain a significant profit from these techniques|j 

On the other hand, private transport has been extensively investigated in 
view of wide-area networks. Roughly speaking, this means “routing planners” 
for cars on city and country maps This problem is different to 

® This experience is supported by personal communications with people from the 
industry. 
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ours in that it is two-dimensional, whereas train timetables induce the time as 
a third dimension: due to the lack of periodicity, the earliest departure time is 
significant in our scenario. In contrast, temporal aspects do not play any role in 
the work quoted above^So it is not surprising that the research has focused on 
purely geometric techniques. 

For completeness, we mention the work on variants of Dijkstra’s algorithm 
that are intended to efficiently cope with large data in secondary memory. Chap- 
ter 9 of n gives an introduction to theoretical and practical aspects. As men- 
tioned above, the problems caused by the slow access to secondary memory are 
beyond the scope of our paper. 

Contribution of the paper. We implemented and tested various optimality-pre- 
serving speed-up techniques for Dijkstra’s algorithm. The study is based on 
all train data (winter period 1996/97) of the Deutsche Bahn AG, the national 
railroad and train company of Germany. The processed queries are a “snapshot” 
of the central i7a/a|j server of the Deutsche Bahn AG, in which all queries of 
customers were recorded over several hours. The result of this snapshot comprises 
more than half a million queries, which might suffice for a representative analysis 
(assuming that the typical query profile of customers does not vary dramatically 
from day to day). 

Due to the above-mentioned insight that the periodicity of the timetables 
is not a promising base for algorithmic approaches, the question is particularly 
interesting whether geometric techniques like those in routing planners are suc- 
cessful, although the scenario has geometric and temporal characteristics. We 
will see that this question can indeed be answered in the affirmative. 

2 Algorithms 

Train graph. The arrival or departure of a train at a station will be called an 
event. The train graph contains one node for every event. Two events v and 
w are connected by a directed edge v ^ w v represents the departure of 
a train at some station and w represents the very next arrival of this train at 
some other station. On the other hand, two successive events at the same station 
are connected by an edge (in positive time direction), which means that every 
station is represented in the graph by a cycle through all of its events (the cycle 
is closed by a turn-around edge at midnight). 

In each case, the length of an edge is the time difference of the events repre- 
sented by its endnodes. Obviously, a query then amounts to finding a shortest 
path from the earliest event at the start station not before the earliest departure 
time to an arbitrary arrival event at the destination. 

The data contains 933, 066 events on 6, 961 stations. Gonsequently, there are 
933, 066 • 3/2 = 1, 399, 599 edges in the graph. 

^ In principle, temporal aspects would also be relevant for the private transport, for 
example, the distinction between “rush hours” and other times of the day. 

® Hafas is a trademark of the Hacon Ingenieurgesellschaft mbH, Hannover, Germany. 
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Fig. 1. The frequency distribution histogram of the queries from the “snapshot” 
according to the Euclidean distance between the start station and the destination 
(granularity: 15 kilometers). 



Priority queue. Dijkstra’s algorithm relies on a priority queue, which manages 
the nodes on the current “frontier line” of the search. As mentioned above, the 
best general worst-case bound, 0(m+n log n), is obtained from Fibonacci heaps, 
where n is again the number of nodes and m the number of edges. We do not use 
a Fibonacci heap but a normal heap (also often called a 2-heap), which yields 
an 0{{n + m)logn) bound 0. Since m S 0{n) in train graphs, both bounds 
reduce to 0{nlogn). 

As an alternative to heaps, we also implemented the dial variant as described 
in Basically, this means that the priority queue is realized by an array of 
buckets with a cyclically moving array index. The nodes of the frontier line are 
distributed among the buckets, and it is guaranteed that the very next non- 
empty bucket after the current array index always contains the candidates to be 
processed next. 

Search horizon. Of course, we deviate from the “textbook version” of Dijkstra’s 
algorithm in that we do not compute the distance of every node from the start 
node but terminate the algorithm immediately once the first (and thus opti- 
mal) event at the destination is processed. The most fundamental optimality- 
preserving speed-up technique for our scenario is then a reduction of the search 
to a (hopefully) small part of the graph, which contains all relevant events. 
Figs.^^Jdemonstrate that such a reduction is crucial. In fact, they reveal that 
for the majority of all queries only small fractions of the total area and time 
horizon are relevant. 

To our knowledge, some commercial implementations remove nodes and edges 
from the graph (more or less heuristically, i.e. losing optimality) before the 
search itself takes place. In contrast, we aim at an evaluation of optimality- 
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Fig. 2. Like Fig.^except that the abscissa now denotes the minimal travel time 
in minutes (granularity: 20 minutes). 



preserving strategies, so our approach is quite different. First of all, we apply 
the amortization technique discussed in to obtain a sublinear expected run 
time per query. The only obstacle to sublinearity is the initialization of all nodes 
with infinite distance labels in the beginning of the textbook algorithm; in fact, 
Figs. D^tndQ strongly suggest that on average the main loop of the algorithm 
only processes a very small fraction of the graph until the destination is seen 
and the algorithm terminates. 

As described in every node is given an additional time stamp, which 
stores the number of the query in which it was reached in the main loop. When- 
ever a node is reached, its time stamp is updated accordingly. If this update 
properly increases the time stamp, the distance label is regarded as infinite, 
otherwise the value of the distance label is taken as is. Consequently, there is 
no need for an expensive initialization phase, and no event outside the “search 
horizon” of the main loop is hit at all. 

The following two additional techniques rely on this general outline. They 
are independent of each other in the sense that one of them may be applied 
alone, or both of them may be applied simultaneously. 

Angle restriction. This technique additionally relies on the coordinates associ- 
ated with the individual stations. In a preprocessing step, we apply Dijkstra’s 
algorithm to each event to compute shortest paths from this event to all other 
stations^ The results are not stored (this would require too much space) but 
only used to compute two values a and [3 for each edge. These 2 • m values are 
stored and then used in the on-line system. More specifically, these values are 



This preprocessing takes several hours, which is absolutely acceptable in practice. 
Hence, there is no need to optimize the preprocessing. 
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Fig. 3. For each minimal travel time (granularity: 20 minutes) the total CPU 
times of all queries yielding this value of the travel time are summed up to reveal 
which range of travel times contributes most of the total CPU time. 



to be interpreted as angles in the plane. Let u ^ w be an edge, and let s be the 
station of event v. Then the values a and /3 stored for this edge span a circle sec- 
tor with center s. The meaning is this: if the shortest path from event v to some 
station s' contains v w, then s' is in this circle sector. Clearly, the brute-force 
application of Dijkstra’s algorithm in the preprocessing allows one to compute 
the narrowest possible circle sector for each edge subject to this constraint. 

Consequently, edge v w may be ignored by the search if the destination is 
not in the circle sector of this edge. The restriction of the search to edges whose 
circle sectors contain the destination is the strategy that we will call “angle 
restriction” in the following. 



Selection of stations. The basic idea behind this technique is similarly imple- 
mented in various routing planners for the private transport A certain 

small set of nodes is selected. For two selected nodes v and w, there is an edge 
u — *■ w if, and only if, there is a path from u to w in the train graph such that 
no internal node of this path belongs to the selected ones. In other words, every 
connected component of non-selected stations (plus the neighboring selected 
stations) is replaced by a directed graph defined on the neighboring selected 
stations. This constitutes an additional, auxiliary graph. 

The length £{v, w) of an edge u — > w in this auxiliary graph is defined as 
the minimum length of a path from u to w in the train graph that contains no 
selected nodes besides v and w. The auxiliary graph and these edge weights are 
also constructed once and for all in a preprocessing step (which only takes a few 
minutes). Each query is then answered by the computation of a shortest path 
in the auxiliary graph, and this path corresponds to a shortest path in the train 
graph. 
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Fig. 4. The relation of the number of edges hit without (abscissa) and with 
(ordinate) strategy angle selection (granularity: 500 edges). 



We implement this general approach as follows. First of all, note that it is 
not necessary to reconstruct the path in the train graph that corresponds to 
the shortest path computed in the auxiliary graph. In fact, what we really want 
to have from the computation is a sequence of trains and the stations where to 
change train. This data can be attached to the edges of the auxiliary graph in an 
even more compact (less redundant) and thus more efficiently evaluable fashion 
than to the edges of the train graph. 

We select a set of stations, and the events of these stations are the selected 
nodes. Clearly, there is a trade-off. Roughly speaking, the smaller the number of 
selected stations is, the larger the resulting connected components are and, even 
worse, the larger the number of selected stations neighboring to a component. 
Since the number of edges depends on the latter number in a quadratic fashion, 
an improvement of performance due to a rigorous reduction of stations is soon 
outweighed by the tremendous increase in the number of edges. 

It has turned out that in our setting, a minor refinement of this strategy is 
necessary and sufficient to overcome this trade-off. For this, let u, v, and w be 
three selected events such that edges u ^ v, v ^ w, and u ^ w exist in the 
auxiliary graph. If £(u,v) + £{v,w) < £(u,w), then edge u — > w is dropped in 
the auxiliary graph. Again, optimality is preserved. The number of edges grows 
only moderately after this modification, so a quite small set of selected stations 
becomes feasible. 

In the data available to us, every station is assigned an “importance number,” 
which is intended to rank its degree of “centrality” in the railroad network. The 
computational study is based on the 225 stations in the highest categories (see 
Fig.O. These stations induce 95,423 events in total, which means that the 
number of events is approximately reduced by a factor of 10, and the number of 
stations is reduced by a factor of 31. This discrepancy between these two factors 
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is not surprising, because central stations are typically met by more trains than 
marginal stations. 

Combination of both strategies. In principle, these two strategies can be com- 
bined in two ways, namely the angle restrictions can be computed for the auxil- 
iary graph, or they can be computed for the train graph and simply taken over 
for the auxiliary graph. Not surprising, we will see in the next section that the 
former strategy outperforms the latter one. 

3 Analysis of the Algorithmic Performance 

The experiments were performed on a SUN Sparc Enterprise 5000, and the code 
was written in C-|— I- using the GNU compiler. 

TableDpresents a summarizing comparison of all combinations of strategies. 
Note that the total number of algorithmic steps is asymptotically dominated by 
the number of operations inside the priority queue. In other words, these opera- 
tions are representative operation counts in the sense of [J, Sect. 18. More specif- 
ically, for a heap the number of exchange operations is representative, whereas 
for the dial variant the number of cyclic increments of the moving array index 
is representative. The average total number per query of these operations are 
listed in the last part of TableD 

For both implementations of the priority queue, the CPU times imply the 
same strong ranking of strategies. Figs. Q and 0 might give a visual impression 
why this ranking is so unambiguous. Moreover, the discrepancy between the 
heap and the dial implementation also decreases roughly from row to row. This 
is not surprising: the overhead of the heap should be positively correlated with 
the size of the heap, which is significantly reduced by both strategies. 

Our experience with several versions of the code is that the exact CPU times 
are strongly sensitive to the details of the implementation, but the general ten- 
dency is maintained and seems to be reliable. In particular, the main question 
raised in this paper (whether optimality-preserving techniques are competitive) 
can be safely answered in the affirmative at least for the restriction of the prob- 
lem to the total travel time as the only optimization criterion. 

However, a detailed look at the results is more insightful. Fig. 0 shows 
that there is a very strong linear correlation between the number of edges hit 
with/without strategy “angle restriction.” On the other hand, Fig.0shows a 
detailed analysis of one particular, exemplary combination of strategies. A com- 
parison of these diagrams reveals an interesting effect, which is also found in the 
analogous diagrams of the other combinations: the CPU times of both the heap 
and the dial implementation are linear in the number of nodes hit by the search. 
This correlation is so strong and the variance is so small that corresponding 
diagrams in Fig.Qlook almost identical. Figure Qreveals the cause. 

In other words, in both cases the operations on the priority queue take con- 
stant time on average, even when the average is taken over each query separately! 
This is in great contrast to the asymptotic worst-case complexity of these data 
structures. 
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Selection of stations 


Angle restriction 


CPU heap 


CPU dial 


no 


no 


0.310 


0.103 


no 


yes 


0.036 


0.018 


yes 


no 


0.027 


0.012 


yes 


train 


0.007 


0.005 


yes 


auxiliary 


0.005 


0.003 



Selection of stations 


Angle restriction 


Nodes 


Edges 


no 


no 


17576 


31820 


no 


yes 


4744 


10026 


yes 


no 


2114 




3684 


yes 


train 


1140 




2737 


yes 


auxiliary 


993 




2378 


Selection of stations 


Angle restriction 


Ops. heap 


Ops. dial 


no 


no 


246255 


23866 


no 


yes 


24526 


3334 


yes 


no 


26304 


3660 


yes 


train 


4973 


1197 


yes 


auxiliary 


3191 


932 



Table 1. A summary of all computational results for the individual combina- 
tions of techniques. The entries “train” and “auxiliary” in column ^2 refer to 
the graph in which the angles were computed (see the last paragraph in Sec- 
tionQ. The columns #3-#4 give the average over all queries of the “snapshot.” 
More specifically, the first table gives the average raw CPU times, the second 
table the average number of nodes and edges hit by the search, and the third 
table the average operation counts. 



4 Conclusion and Outlook 

The outcome of this study suggests that geometric speed-up techniques are a 
good basis for the computation of provably optimal connections in railroad traffic 
information systems. The question raised in this paper is answered for the total 
travel time: the best combinations of strategies are by far faster than is currently 
needed in practice. This success is a bit surprising, because the underlying data 
is not purely geometric in nature. 

Another surprising outcome of the study is that both the normal heap and 
the dial data structure only require an (amortized) constant time per operation, 
whereas the worst-case bound is logarithmic for heaps and even linear for dials. 
Note that no amortization over a set of queries must be applied to obtain a con- 
stant time per operation; the variance is small enough that the average run time 
per operation within a single query can essentially be regarded as bounded by a 



Dijkstra’s Algorithm On-Line 



119 



constant. Due to the fact that the variance is negligible, a “classical” statistical 
analysis would not make any sense. 

The minimal travel time is certainly an empirical research topic in its own 
right, not only because it is the most fundamental objective in practice. However, 
a practical algorithm must consider further criteria and restrictions. For example, 
the ticket costs and the number of train changes are also important objectives. 
Moreover, certain trains do not operate every day, and certain kinds of tickets 
are not valid for all trains, so it should be possible to exclude train connections 
in a query. A satisfactory compromise must be found between the speed of the 
algorithm and the quality of the result. Thus, the problem is not purely technical 
anymore but also involves “business rules,” which are usually very informal. 

In the future, an extensive requirements analysis will be necessary, which 
means that the work will be no longer purely “algorithmical” in nature. Such an 
analysis must be very detailed because otherwise there is no hope to match the 
real problem. Unfortunately, there is high evidence that the general problems 
addressed in will become virulent here: a sufficiently simple formal model 
that captures all relevant details does not seem to be in our reach, and many 
details are “volatile” in the sense that they may change time and again in un- 
foreseen ways. Future research will show whether these conflicting criteria can 
be simultaneously fulfilled satisfactorily. 
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Fig. 5. An exemplary sequence of diagrams for one particular combination of 
strategies: “angle restriction” is applied, but “selection of stations” is not. First 
column: the frequency distribution histogram of all queries in the “snapshot” ac- 
cording to (a) the number of nodes met by the search (granularity: 500 nodes), 
(b) the CPU time for the heap implementation and (c) for the dial implementa- 
tion (granularity: 10 milliseconds). Second column: the average of (a) the number 
of nodes met and (b/c) the CPU times for the heap/dial variant taken over all 
queries with roughly the same resulting total travel times (granularity: 20 min- 
utes). The strong resemblance of the diagrams in each row nicely demonstrates 
the linear behavior of both priority-queue implementations. 









Fig. 7. The left picture shows the edges hit by Dijkstra’s algorithm from Berlin 
Main East Station until the destination Frankfurt/Main Main Station is reached. 
In the right picture, the strategy “angle selection” was applied to the same query. 
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Fig. 8. The first picture shows the 225 stations selected for the study on strategy 
“selection of stations.” The remaining three pictures refer to the same query as 
in Fig. Q However, now the strategy “selection of stations” is applied with no 
angle restriction (upper right), with angles computed from the train graph (lower 
left), and with angles computed from the auxiliary graph itself. The train graph 
is shown in the background. The highlighted edges are the edges of the auxiliary 
graph hit by the search. 
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Abstract. In this paper we describe robust and efficient implementa- 
tions of two graph connectivity algorithms. The implementations are 
based on the LEDA library of efficient data types and algorithms 
Moreover, we provide experimental evaluations of the implemented al- 
gorithms and we compare their performance to other graph connectivity 
algorithms currently implemented in LEDA. 

The first algorithm is the Karp and Tarjan algorithm for finding the 
connected components of an undirected graph. The algorithm achieves 
to find the connected components of a graph G — {V,E} in 0(|U|) ex- 
pected time. This is the first expected-time algorithm for the static graph 
connectivity problem implemented in LEDA. The experimental evalua- 
tion of the algorithm proves that the algorithm performs well in practice, 
and establishes that theoretical and experimental results converge. The 
standard procedure provided by LEDA for finding the connected compo- 
nents of a graph, called COMPONENTS has running time 0(|U| -I- |E|). 
We have compared the performance of Karp and Tarjan’s algorithm to 
the one of COMPONENTS and we have proved that there exists a wide 
class of graphs (those that they are dense) that the performance of the 
first algorithm dramatically improves upon the one of the second. 

The second implemented algorithm is the Nikoletseas, Reif, Spirakis and 
Yung poly logarithmic algorithm for dynamic graph connectivity. The 

algorithm can cope with any random sequence of three kinds of oper- 
ations: insertions, deletions and queries. The experimental evaluation 
of the algorithm proves that it is very efficient for particular classes of 
graphs. Comparing the performance of the implemented algorithm to the 
one of other dynamic connectivity algorithms implemented in LEDA, we 
conclude that the algorithm always performs better than all these algo- 
rithms for dense random graphs and random sequences of operations. 
Moreover, it works very efficiently even for sparse random graphs when 
the sequence of operations is long. 
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1 Introduction 



In many areas of computer science, graph algorithms play an important role: 
problems modeled as graphs are solved by computing a property of the graph. If 
the underlying problem instance changes incrementally, algorithms are needed 
that quickly compute the property in the modified graph. A problem is called 
fully dynamic when both insertions and deletions of edges take place in the un- 
derlying graph. The goal of a dynamic graph algorithm is to update the solution 
after a change, doing so more efficiently than re-computing it at that point from 
scratch. To be precise, a graph dynamic algorithm is a data structure that sup- 
ports the following operations: (1) insert an edge e; (2) delete an edge e; (3) test 
if the graph fulfills a certain property, e.g. are two given vertices connected? The 
adaptivity requirements (insertions/deletions of edges) usually make dynamic 
algorithms more difficult to design and analyze than their static counterparts. 

Graph connectivity is one of the most basic problems with numerous appli- 
cations and various algorithms in different settings. The area of dynamic graph 
algorithms has been a blossoming field of research in the last years, and it has 
produced a large body of algorithmic techniques Most of the 

results 
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on efficient fully dynamic structures for general graphs were 
based on clustering techniques. This has led to deterministic solutions of an in- 
herent time bound of 0(71*^), for some e < 1, since the key problem encountered 
by these techniques is that the algorithm must somehow balance the work in- 
vesting in maintaining the component of the cluster structure, and the work on 
the cluster structure (connecting the components). 

The use of probabilistic techniques has been also considered for solving dy- 
namic connectivity problems. The theory of random graphs Q provides a rich va- 
riety of techniques that can be helpful in analyzing or improving the performance 
of dynamic graph connectivity algorithms. This theory has been extensively used 
for the theoretical evaluation of several interesting dynamic problems. Besides, 
the use of random inputs of updates on random graphs for the experimental 
evaluation of the performance of dynamic graph algorithms is quite common. 

Nikoletseas, Reif, Spirakis and Yung |21) | have presented a fully dynamic 
graph connectivity algorithm with polylog arithmic execution time per update for 
random graphs. Their algorithm can cope with any random sequence of three 
kinds of operations: 



Property-Query(parameter): Returns true if and only if the property holds 
(or returns a subgraph as a witness to the property). For a connectivity query 
(u,v), a true answer means that the vertices u, v are in the same connected 
component. 

Insert(x,y): Inserts a new edge joining x to y (assuming that {x, y} ^ E). 
Delete(x,y): Deletes the edge {x,y} (assuming {x,y} G E). 

In the previous operations it is assumed that random updates (insertions and 
deletions) have equal probability 1/2 and the edges to be deleted/inserted are 
chosen randomly. The algorithm admits an amortized expected cost of 0(log^ n) 
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time per update operation with high probability and an amortized expected cost 
of 0(1) per query. 

Another interesting work, published at about the same period, is by M. Hen- 
zinger and V. King ini . Their work presents a technique for designing dynamic 
algorithms with polylog arithmic time per operation and applies this technique 
to the dynamic connectivity, bipartiteness, and cycle equivalence problem. The 
resulting algorithms are Las- Vegas type randomized algorithms. The connectiv- 
ity algorithm achieves O(log^n) update time and O(logn) query time for both 
random and non-random sequences of updates. 

The area of dynamic graph connectivity algorithms is reach of theoretical 
results. However, it seems that the implementation and experimental study of 
this kind of algorithms are on the opposite poor. It is only recently that some of 
these algorithms have been implemented in C-|— I- and tested under the LEDA 
Extension Package Dynamic Graph Algorithms (LEPDGA) [5^3- To the best 
of our knowledge, there are only three works towards this direction. In the first 
of these works Q, G. Amato et al. implemented and tested Frederickson’s al- 
gorithms 1^3 1 and compared them to other dynamic algorithms. In the second 
work [^, the dynamic minimum spanning tree based on sparsification by Epp- 
stein, Galil, Italiano and Spencer 0 and the dynamic connectivity algorithm 
presented by M. Henzinger and V. King were implemented and experimen- 
tally evaluated. For random inputs the algorithm by Henzinger and King is the 
fastest. For non-random inputs sparsification was the fastest algorithm for small 
sequences, while for medium and large sequences of updates, the Henzinger and 
King’s algorithm was faster. The third work by D. Frigioni et al. con- 
centrates on the study of practical properties of many dynamic algorithms for 
directed acyclic graphs. More specifically, the practical behaviour of several dy- 
namic algorithms for transitive closure, depth first search and topological sorting 
for such graphs have been studied. 

In this paper, we implement and experimentally evaluate two graph connec- 
tivity algorithms. The first one, namely GGOMPONENTS, is a linear expected- 
time algorithm for the static connectivity problem. The algorithm, proposed by 
R. Karp and R. Tarjan ^3, achieves to find the connected components of a 
graph G = (V,E) in 0{n) expected time, where n = |V|. It is worth pointing 
out that all deterministic algorithms for the same problem work in 0{n + m) 
execution time in the worst case, where m = \E\. The only algorithm for this 
problem provided by LEDA is such a deterministic algorithm and thus, it runs 
in 0{n + m) execution time in the worst case. Apparently, the theoretical results 
imply that GGOMPONENTS improves over GOMPONENTS by an additive 
factor of m. Our work proves that this theoretical result can be verified exper- 
imentally. We prove through a sequence of experiments that the performance 
time of GGOMPONENTS on dense random graphs is extremely faster than the 
one of GOMPONENTS. 

The second work, namely NRSY, is the Nikoletseas, Reif, Spirakis and Yung 
algorithm for dynamic connectivity. We implement the algorithm in G-f-f using 
LEDA and LEPDGA and evaluate its performance. More specifically, our results 
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imply that the algorithm behaves efficiently in case of dense random graphs, as 
stated by theory. Comparing the performance of the implemented algorithm to 
the performance of other dynamic connectivity algorithms Q implemented in 
LEDA, we found that the algorithm performs better than those algorithms for 
random graphs with at least O(nlnn) edges, where n is the number of nodes in 
the graph. For such graphs and for random sequences of operations, our imple- 
mentation is proved the fastest; it is faster even than the Henzinger and King 
algorithm, which has been considered the fastest implemented algorithm for such 
inputs thus far (see e.g., n)- For sparser graphs, our implementation is also very 
efficient in case the sequence of updates is not too short. 

All our implementations are written in C++ and use LEDA/LEPDGA. The 
source codes are available at the following URL: 

http : //www. ceid.upatras . gr/~f aturu/projects .htm 

The rest of our paper is organized as follows. SectionQ describes how Karp 
and Tarjan’s algorithm for finding the connected components of a graph has 
been implemented, and provides experimental results on its performance. Sec- 
tion Q concentrates on the implementation and performance evaluation of the 
NRSY dynamic connectivity algorithm. We conclude with a brief discussion of 
our results in Section 0 



2 The Karp and Tarjan’s Algorithm 

2.1 Algorithm Description 

In this section, we present the main ideas of the randomized algorithm of Karp 
and Tarjan, described in ^3, for finding the connected components of a graph. 

Consider any graph G = (U, E) and let n = \V\ and m = \E\. The algorithm 
finds the connected components of G in 0{n) expected time. The algorithm 
proceeds in two stages. The first stage, called sampling, finds a giant connected 
sub-graph of G as follows: 

(1) Let Eo = 0. 

(2) If \E \ Eq\ < n,let D = E\ Eq. Otherwise, let D be a set of n distinct edges 
randomly selected from E\ Eq. 

(3) Let Eq — Eq U E. 

(4) Use DFS (Depth First Search, see e.g. |1^) to find the connected components 
of Go = (V,Eq). If E = Eq or Go has a connected component of at least On 
vertices, where 6* > 1/2 is a constant, stop. Otherwise, return to Step (2). 

The second stage of the algorithm, called cleanup, finishes the grouping of 
vertices into components and is executed if a giant component is found in the 
sampling stage . This stage uses the standard depth-first search procedure with 
a few small changes. Initially, all vertices in the giant component of Go are 
marked giant, while the remaining vertices are unmarked. An unmarked vertex 
is selected as a start vertex and a depth first search is initiated from this vertex. 
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If an edge leading to the giant component is found, the search is immediately 
aborted and all vertices reached during the search are marked giant. If the 
search finishes without reaching the giant component, all vertices reached are 
marked as being in a new component. The process of carrying out such a search 
from an unmarked vertex is repeated until no unmarked vertices remain. 

The design of the algorithm, as well as its good performance is based upon 
the following three observations: (1) there is a well-known depth- first search 
procedure that finds the connected components of an undirected graph in 
0(m) worst-case time; (2) Erdos and Renyi have shown that there exists a 
constant 0 > 1/2 such that with probability tending to one as n tends to infinity, 
a random graph with n vertices and n edges has a giant connected component 
with at least On vertices; (3) there is no need to examine all the edges of a graph 
in order to find its connected components; it suffices to exhaust the adjacency 
lists of the vertices in all but one of the components and to look at enough edges 
in the remaining component to verify that it is connected. 

Karp and Tarjan have proved that their algorithm computes the connected 
components of any graph in 0(|y|) expected time; moreover, the probability 
that the running time exceeds c\V\, where c > 0 is a constant, tends to zero 
exponentially fast as |K| tends to infinity. 

2.2 Implementation 

For the implementation of the algorithm, the LEDA library has been used. More 
specifically, the graph data structure provided by LEDA, as well as useful op- 
erations implemented on graphs (e.g., node or edge arrays, iterators, etc.) have 
been particularly helpful for the implementation of the algorithm. 

A function, called GCOMPONENTS, has been implemented. It takes as ar- 
guments a graph G = (E, E) and a float 0, and computes the connected compo- 
nents of the underlying undirected graph, i.e., for every node v G V, an integer 
compmim[u] from [0 . . . num_of .components— 1] is returned, where the value of the 
variable num.of .components expresses the number of connected components of 
G and v belongs to the z-th connected component if and only if compnum[z;] = i. 
GCOMPONENTS returns a boolean, indicating the existence or not of a giant 
component with On vertices in G. 

The implementation of GCOMPONENTS is implied directly by the algorithm 
of Karp and Tarjan The two stages of the algorithm are indicated by the 
call of two procedures, called sampling and cleanup. Procedure sampling is called 
first and if a giant component of the graph G exists, cleanup is executed. The 
main idea of sampling is that a new graph Go = (E, Eq) is incrementally created, 
by adding edges of G. Initially, Eq is empty, while as many nodes as in G are 
inserted in Eq. A node array is used for direct mapping of Go nodes to G nodes. 
Procedure sampling executes a while loop. During each iteration of the loop, a set 
of at most |E| distinct edges are inserted in Eq and the steps of the first stage, 
as described in Section^^^^re executed. The existing procedure COMPONENTS 
of LEDA is used for finding the connected components of Gq. Since the running 
time of COMPONENTS is 0(n-|- m) and Go has 0(n) edges, the running time 
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Fig. 1. The execution times of GCOMPONENTS and COMPONENTS for a graph 
with 1600 nodes. 



of COMPONENTS on Gq is 0(n). Procedure cleanup simply executes a depth- 
first search procedure starting from every UNMARKED node in order to share all 
remaining nodes in the proper connected components. 

The code described above has been compiled and tested under the C-|— I- com- 
piler g-h- h. In order to testify its correctness a flexible main function has been 
created and a lot of executions have been performed for graphs with different set 
of nodes and edges. The connected components calculated by GCOMPONENTS 
have been compared with those calculated by COMPONENTS. For graphs with 
small node/edge sets this comparison was straightforward. However, such com- 
parisons can not be easily performed when the number of nodes becomes large. 
In this case, we used proper UNIX scripts for comparing the results produced by 
the two algorithms. 



2.3 Performance Evaluation 

The time performance of GCOMPONENTS has been compared to the one of 
COMPONENTS. COMPONENTS performs slightly better for sparse graphs, that 
is, for graphs with a small set of edges. Apparently, this is natural since if |E| G 
0{\V\), the 0(|F| + |E|) running time of COMPONENTS becomes 0(|U|), while 
additionally the implementation of COMPONENTS is simpler than the one of 
GCOMPONENTS. However, GCOMPONENTS performs dramatically better than 
COMPONENTS for dense graphs, that is, graphs with large sets of edges (e.g., 
|E|GO(|U|log|y|)or |E|GO(|yp)). 

For example, on a graph of 200 nodes, COMPONENTS performs slightly bet- 
ter than GCOMPONENTS if the number of edges is less than 5000. For denser 
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graphs GCOMPONENTS becomes better than COMPONENTS. For large graphs, 
e.g., graphs with 5000 nodes and at least 500000 edges, there exists a tremendous 
difference on the performance of the two algorithms (for instance, the execution 
time of COMPONENTS is up to several seconds, while GCOMPONENTS ter- 
minates after at most a few milliseconds). For even larger/denser graphs, the 
execution time of COMPONENTS becomes really huge (more than a few min- 
utes), while GCOMPONENTS behaves in an extremely faster way. 

The experimental evaluation of GCOMPONENTS implies that for graphs with 
fixed number of nodes, the algorithm has almost the same running time inde- 
pendent of the number of edges in the graph. Thus, the experiments verify the 
theoretical result that the running time of GCOMPONENTS does not depend on 
the number of edges in the graph. Another interesting observation is that the 
performance of GCOMPONENTS is linearly increasing with n, as stated by the 
theoretical results. 

F^ureQ presents the performance of both algorithms for a graph of 1600 
nodefl All our experiments took place on an ULTRA ENTERPRISE 3000 SUN 
SPARC station with two processors working at 167MHz with 256 Mbytes of mem- 
ory. 

3 The NRSY Algorithm 

3.1 Description 

In this section, we describe the dynamic connectivity algorithm of Nikoletseas, 
Reif, Spirakis and Yung (NRSY), presented in 

The algorithm alternates between two epochs, namely the activation epoch 
and the retirement epoch. As long as the algorithm is in the activation epoch, 
each operation is performed in at most polylogarithmic time. Queries are per- 
formed in constant time. However, in order to perform these operations fast, 
the algorithm may examine many edges per delete operation, which means that 
after a small number of delete operations the algorithm has probably looked 
at the entire graph in which case it cannot use the fact that the graph is ran- 
dom. If this happens, the algorithm switches to the retirement epoch, where 
any dynamic connectivity algorithm presented in the literature, with expected 
update time at most 0(n), should be used for performing a specific number of 
operations, probably in a higher cost than a polylogarithmic one. However, the 
number of operations performed during a retirement epoch is not too large, so 
that the amortized cost per operation of the algorithm has been proved to be 
polylogarithmic . 

Initially, the algorithm uses a slightly modified version of the linear expected- 
time algorithm for finding connected components by Karp and Tarjan to 
compute a forest of spanning trees for the original graph. We call the above 
procedure of computing the forest of spanning trees a total reconstruction. We say 
that a total reconstruction is successful if it achieves to find a giant component 

^ More diagrams demonstrating the above observations are provided in 
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of the graph. It is worth pointing out that a total reconstruction is an expensive 
operation, since all data structures of the algorithm should be re/initiated. 

A graph activation epoch is entered after a successful total reconstruction. 
The activation epoch lasts at most cin^ logn operations, where ci > 1 is a con- 
stant. However, the epoch may end before all these operations are performed. 
This happens when a deletion operation disconnects the spanning tree of the 
giant component and the attempted fast reconnection fails. A graph retirement 
epoch starts when the previous graph activation epoch ends. However, the algo- 
rithm may start by a retirement epoch if the initial total reconstruction is not 
successful. During the course of an execution of a retirement epoch, the algo- 
rithm does not maintain any data structure. It simply uses some other algorithm 
for dynamic connectivity (any algorithm presented in the literature with update 
time at most 0(n)) to perform the operations. A graph retirement epoch lasts 
at least C2ulog^ n operations, where C2 > 1 is a constant. At the end of a graph 
retirement epoch, a total reconstruction is attempted. If it is successful, a new 
graph activation epoch is entered. Otherwise, the graph retirement epoch con- 
tinues for another set of C2ulog^ n operations. The above procedure is repeated 
until a successful total reconstruction occurs. 

During an activation epoch the algorithm maintains a forest of spanning 
trees, one tree per connected component of the graph. Moreover, the algorithm 
categorizes all edges of the graph into three classes: (1) pool edges, (2) tree edges, 
and (3) retired edges. 

During an activation insertion operation, if the edge joins vertices of the same 
tree, the edge is marked as a retired edge and the appropriate data structures are 
updated; otherwise, the edge joins vertices of different trees, and the component 
name of the smaller tree is updated. During an activation deletion operation, if 
the edge is a pool or a retired edge, the edge is deleted by all data structures. 
Otherwise, if the edge is a tree edge of a small tree and the deletion of the 
edge disconnects the corresponding graph component, the tree is split into two 
small trees and the smaller of the two trees is relabeled. If on the opposite, a 
reconnection is possible, the tree is reconnected. If the edge is an edge of the 
tree of the giant component, a specific procedure, called NeighborhoodSearch, is 
executed. 

Let the deleted edge be e = (u, v). Moreover, let T{u), T{v) be the two pieces 
in which the tree is split by the removal of e, such that u G T(u) and v € T(v). 
NeighborhoodSearch executes a sequence of phases. A phase starts when a new 
node is visited. NeighborhoodSearch starts two BreadthFirstSearch procedures, 
the one out of u and the other out of v and executes them in an interleaved 
way; that is, the algorithm visits nodes by executing one step by each BFS. The 
visit of a node, independently of which of the two searches reaches the node, 
indicates the start of a new phase. In any phase an attempt for reconnection 
occurs. If this reconnection is achieved, the phase returns success; otherwise, it 
returns failure. The total number of nodes visited by each BFS procedure equals 
C3 log n, where C3 > 1 is a constant. After this number of nodes have been visited, 
NeighborhoodSearch ends, independently of whether it achieves to reconnect the 
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giant tree or not. If any phase returns success, NeighborhoodSearch also returns 
success. If more than one phases returns success, the edge which is “closer” to the 
root is selected for the reconnection. Thus, NeighborhoodSearch may: (1) finish 
with no success, in which case the algorithm undergoes total reconstruction and 
the current activation epoch ends. NeighborhoodSearch reports failure in this 
case; ( 2 ) the nodes connected to u may be exhausted before the search finishes, 
so that the one of the two components is a small component of 0 (log n) nodes; let 
this small component be G{u). We search all edges emanating from nodes of G{u) 
and if a reconnection is impossible, the giant component is just disconnected to 
a still-giant component and a midget component of O(logn) nodes. In this case 
the midget component is renamed, while NeighborhoodSearch reports success; 
(3) several phases may report a number of successes. In this case, the algorithm 
chooses for reconnection the edge which is “closer” to the root of the tree and 
reconnects the two pieces. 

We now describe what happens during a phase initiated by the visit of some 
node w. Assume that w is reached due to the execution of the BPS initiated at 
node u, so that w G T(u). Roughly speaking, the algorithm checks if a randomly 
chosen (if any exists) pool edge out of node w reconnects the two pieces of the 
giant tree and if this reconnection is a “good” one. A good reconnection is one 
that does not increase the diameter of the tree. If a good reconnection is found, 
procedure Phase reports success; otherwise, it reports failure. 

Graph activation epochs are partitioned into edge deletion intervals of C 4 log n 
deletions each, where C 4 > 1 is a constant. All edges marked as retired during 
an edge deletion interval are returned into the pool after a delay of one more 
subsequent edge deletion interval. Edges of small components are not reactivated 
by the algorithm. 

Nikoletseas, Reif, Spirakis and Yung have proved that the amortized ex- 
pected time of algorithm NRSY for update operations on the graph is O(log^n) 
with high probability; moreover, they have proved that the query time of their 
algorithm is 0 ( 1 ). 

3.2 Implementation 

The algorithm has been implemented as a data structure (a class called 
nrsy_connectivity, in C-(--|-). For its implementation the Library of Efficient 
Data Types and Algorithms (LEDA), as well as the LEDA Extension Package- 
Dynamic Graph Algorithms (LEPDGA) have been used. It is worth pointing out 
that our implementation has been done using the new base class of LEDA for 
dynamic graphs Q. The public methods of class nrsy_connectivity that are effi- 
ciently supported are: insert_new_edge, delete_edge and all queries (“global” 
or not). A few more public methods are provided for collecting statistics, as 
well as for demonstrating how the algorithm works (that is, a graphical user 
interface) . 

In the activaton epoch, each node maintains both a set of pool edges and 
a priority queue of retired edges incident to it. Each node is labeled by the 
identification of the component it belongs. 
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A forest of spanning trees is maintained during the execution of activaton 
epochs of the algorithm. Each node of the tree maintains a pointer to its parent, 
its left and right children (pointers Ic, rc, respectively), as well as to its left and 
right siblings (pointers Is, rs, respectively). If a node does not have children 
pointers Ic, rc are nil. If a node does not have right or left sibling, the corre- 
sponding pointer points to the node’s parent. The trees are constructed in 0(n) 
time by using a slightly modified version of Karp and Tarjan’s algorithm. For 
random dense graphs, these trees have diameter of expected length logarithmic 
on the number of nodes in the graph. 

Clearly, the way that the nodes of each tree are connected allows fast traversal 
of parts of the tree (which is required e.g., for updating labels during insert/delete 
operations). It is worth pointing out that an inserted edge that connects nodes 
of two different trees may cause an evert operation to take place in one of the 
two trees, so that some particular node to become the new root of this tree. In 
this case, all nodes starting from the new root up to the old root of the tree 
should be updated. However, since the diameter of the tree is of logarithmic 
length, such operations can be performed in logarithmic time. 

Each node maintains the size of its subtree (that is, the subtree emanated 
by the node). When insertions/deletions that change the forest of the spanning 
trees occur, the smallest of the two trees is updated (e.g., the size and the label 
of the nodes of the smallest tree are updated) . 

Retired edges are maintained in the retired priority queues of their inci- 
dent nodes. A counter measuring the number of operations in each epoch is 
maintained and when an edge becomes retired, its priority takes the value of 
this counter. Reactivation of retired edges occurs only before the deletion of 
tree edges, that is, edges that disconnect one of the spanning trees in the for- 
est. More specifically, reactivations occur only before the execution of procedure 
NeighborhoodSearch, since it is only for this procedure that it is important sev- 
eral pool edges to exist. Reactivation is performed if the difference of the priority 
of a retired edge from the current value of the counter expressing the number of 
operations performed in the current epoch is larger than logn. 

During the retirement epoch, the algorithm by Henziger and King (that is, 
the data structure dcJienzinger_king of LEPDGA [^J^) is used for performing 
each operation. 

The correctness of the implementation has been checked by comparing the 
answers to the queries in sequences of random operations among N RS Y and other 
already implemented graph connectivity algorithms in LEDA. Additionally, we 
have performed tests with random updates, and every time a specified number of 
updates had been done, we checked whether all possible queries were answered 
according to the component labels of the vertices. These labels were computed 
by the static algorithm COMPONENTS or GCOMPONENTS. 

3.3 Performance Evaluation 

In this section, we present the results of comparative experiments with NRSY 
and other algorithms currently implemented in LEDA, using different types and 
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Fig. 2. Theoretical bounds on the performance of each algorithm 



sizes of inputs. More specifically, we measure the performance of the following 
dynamic connectivity data structures: nrsy_connectivity, dc Jienzinger_king 
and dc_simple. The second is the algorithm by M. Henzinger and V. King 
while the third is a simple approach to dynamic connectivity [3^3- Algorithm 
dc_simple maintains and recomputes if necessary component labels for the ver- 
tices and edge labels for the edges in a current spanning forest. Inserting or 
deleting edges which do not belong to the current spanning forest takes constant 
time, while the deletion/insertion of forest edges takes time proportional to the 
size of the affected component(s) (thus, 0{\V\ + \E\) in the worst case). FigureQ 
summarizes information about the performance of the above three algorithms. 
The last two algorithms have been implemented in LED A and their performance 
has been studied experimentally in Q. 

Since NRSY algorithm guarantees good performance only on random inputs, 
we concentrate our study only on random sequences of updates (and queries) 
on random graphs. We are mainly interested in the average case performance of 
the algorithm on such inputs. The random inputs consisted of random graphs 
with different densities, and sequences of random operations. Each sequence of 
random operations consisted of random insertions, deletions and queries on the 
corresponding graph. 

We conducted a series of tests for graphs on n = 50, 100,300,500,750, and 
1000 vertices and different sizes of edge sets. More specifically, for each such n 
we tested the performance of the above algorithms on graphs with m = n/2, 
n, nlnn, and n^/4 edges. For every pair of values for n and m we did five 
experiments and took averages. In every experiment, the same input was used 
for all algorithms. Since the NRSY algorithm may execute several total recon- 
structions during its execution, we didn’t measure separately the time that the 
algorithm spent in preprocessing, that is, for building the data structure for the 
initial graph, and processing, that is for performing the sequence of operations. 
For the other two algorithms (dc Jienzinger_king, dc_simple) we measure the 
time spent only for performing the operations (and not for the initialization of 
the data structures). We measure the performance of the algorithms for long 
update sequences e.g., of 100000 operations, for medium sequences e.g., of 10000 
operations and for short sequences e.g., of 1000 operations. We did all experi- 
ments on an ULTRA ENTERPRISE 3000 SUN SPARC station with two processors 
working at 167MHz with 256 Mbytes of memory. 

The most difficult inputs for the N RS Y algorithm are inputs with m equal or 
slightly larger to n. In this case, the theory of random graphs implies that the 




Experimental Evaluation of Graph Connectivity Algorithms Using LEDA 135 



graph is with high probability disconnected, but it may exist some component of 
size in the order of Thus, a total reconstruction with high probability does 
not succeed to find a giant component, or if it finds one it is highly expected 
that it will become disconnected in the next few operations. Thus, either the al- 
gorithm does not run in activation epoch, or it gets in activation epoch and after 
a few operations it has to run a total reconstruction again. Recall that a total 
reconstruction is an expensive operation. Even if the algorithm gets and remains 
in the activation epoch for some operations, almost all deletions/insertions are 
expected to be tree updates. Thus, in this case most of the operations to be 
performed are expensive. 



Difficult inputs appear also to be those where m < n. In this case, the theory 
of random graphs implies that with high probability the graph consists of several 
components of size 0(log n). Thus, with very high probability a total reconstruc- 
tion will not succeed. At this point the algorithm uses the dc JienzingerJcing 
data structure to execute the retirement epoch. If a total reconstruction suc- 
ceeds after the retirement epoch the algorithm gets into an activaton epoch and 
so on. We expect that nrsy_connectivity and dcJienzingerJcing perform in 
a similar way in this case. However, nrsy_connectivity periodically performs a 
total reconstruction and checks if a giant component exists in the graph. Since a 
total reconstruction is expensive, it would be also natural if nrsy_connectivity 
behaves in a slightly slower way than dc_henzinger_king in this case. 

For inputs with m > n, that is, if m G O(nlnn) or m G O(n^), the theory 
of random graphs implies that a giant component exists in the graph, so that 
nrsy_connectivity should spend most of its time in activation epochs. More- 
over, most of the updates are non-tree updates, so that nrsy_connectivity 
should perform them very fast (in 0(1) time). We further expect that the be- 
havior of the algorithm should be better for long/medium sequences of operations 
than for small sequences, since the theoretical results hold asymptotically. 



The experiments that we perform prove all the above observations. More 
specifically, for long sequences of operations, our implementation is proved to 
be very fast, faster than the Henzinger and King’s algorithm in most cases. For 
medium sequences of operations our implementation has similar performance 
to the one of dc_henzinger_king for values of m close to n, while it is better 
for denser graphs. For short sequences of updates, the performance of our im- 
plementation becomes worse than the other two algorithm for sparse graphs. 
Recall that this is expectable, since the algorithm alternates between retirement 
epochs and total reconstructions, while all operations performed are with high 
probability expensive operations. Notice that this bad behavior of the algorithm 
in this case does not contradict the theoretical results, since all bounds provided 
in 20 have been derived assuming dense random input graphs. Moreover, even 
in the case of short sequences, our implementation has better performance than 
the other two algorithms for dense graphs (that is, if m is at least O(nlnn)). 



For the behavior of the other two algorithms we derive similar observations as 
those presented in Q. Roughly speaking, dc_henzinger_king benefits for small 
components as they appear for m < n; moreover, the most difficult inputs are 
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Fig. 3. Performance of NRSY, Henzinger-King and Simple for a random graph 
of 500 nodes and a sequence of 10000 random operations. 



again those for which m is close to n, while the algorithm gets better and better 
if the number of edges increases (still remaining worse than nrsy_connectivity 
though). For dc_simple, if m increases the expected number of tree updates 
decreases, but each such tree update becomes very expensive. In theory, these 
two effects cancel out, so that it appears that the expected running of dc_simple 
is 0{n) per update. 

The performance of dc_henzinger_king has been compared in Q to the one 
of a sequence of other algorithms, among others sparsification 0, and it has been 
proved that dc_henzinger_king is the fastest algorithm among all for random in- 
puts. Since NRSY performs better than Henzinger and King’s algorithm for such 
inputs (especially if the graph is dense) we conclude that nrsy_connectivity is 
the fastest algorithm for random sequences of updates on dense random graphs, 
implemented under LED A thus far. Moreover, the performance of NRSY is very 
good (comparable or better than the one of Henzinger and King’s algorithm) 
even for non-dense random graphs, if the sequence of updates is long. 

We didn’t test the performance of nrsy_connectivity for non-random in- 
puts, since we believe that the results would not be of theoretical interest. Re- 
call that theory does not guarantee any bound on the update time of NRSY for 
non-random inputs on non-rmdom graphs. Some of the above observations are 
illustrated in FiguresQand^y 



2 



More diagrams demonstrating the above observations are provided in 
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Fig. 4. Performance of NRSY, Henzinger-King and Simple for a random graph 
of 100 nodes and a sequence of 1000 random operations. 

4 Discussion 

We have described elegant, robust and efficient implementations of two graph 
connectivity algorithms, based on LEDA C++ and LEPDGA. Moreover, we have 
shown with experimental data that these algorithms are fast and efficient in 
practice. 

We are currently working on testing the performance of modified versions of 
the NRSY algorithm. One such direction is, for example, to let the algorithm 
remain in the activation epoch till the giant component is disconnected, instead 
of alternating between the two epochs after the execution of a specific number 
of operations. We believe that such modifications may yield better experimental 
results. 
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Abstract. Given a finite set L of lines in the plane we wish to compute 
the zone of an additional curve 7 in the arrangement A(71), namely the 
set of faces of the planar subdivision induced by the lines in L that are 
crossed by 7 , where 7 is not given in advance but rather provided on- 
line portion by portion. This problem is motivated by the computation 
of the area bisectors of a polygonal set in the plane. We present four 
algorithms which solve this problem efficiently and exactly (giving precise 
results even on degenerate input). We implemented the four algorithms. 
We present implementation details, comparison of performance, and a 
discussion of the advantages and shortcomings of each of the proposed 
algorithms. 



1 Introduction 



Given a finite collection C, of lines in the plane, the arrangement A{C) is the 
subdivision of the plane into vertices, edges and faces induced by C. Arrange- 
ments of lines in the plane, as well as arrangements of other objects and in 
higher dimensional spaces, have been extensively studied in computational ge- 
ometry 22 1 , and they occur as the underlying structure of the algorithmic 
solution to geometric problems in a large variety of application domains. The 
zone of a curve 7 in an arrangement A{C) is the collection of (open) faces of the 
arrangement crossed by 7 (see Figuren^or an illustration). 

In this paper we study the following algorithmic problem: Given a set C oi n 
lines in the plane, efficiently construct the zone of a curve 7 in the arrangement 
A{C), where 7 consists of a single connected component and is given on-line, 
namely 7 is not given in its entirety as part of the initial input but rather given 
(contiguously) piece after piece and at any moment the algorithm has to report 
all the faces of the arrangement crossed by the part of 7 given so far. 
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Fig. 1. An example of a simple arrangement induced by 7 lines and a bounding 
box, and the zone of a polygonal curve in it. The polyline vq , . . . crosses 4 faces 

[shaded faces) 



There is a straightforward solution to our problem. We can start by con- 
structing A[C) in 0['n?) time, represent it in a graph-like structure Q (say, the 
half-edge data structure ^ Chapter 2]), and then explore the zone of 7 by 
walking through Q. However, in general we are not interested in the entire ar- 
rangement. We are only interested in the zone of 7 , and this zone can be anything 
from a single triangular face to the entire arrangement. 

The combinatorial complexity (complexity, for short) of a face in an arrange- 
ment is the overall number of vertices and edges on its boundary. The complexity 
of a collection of faces is the sum of the face complexity over all faces in the col- 
lection. The complexity of an arrangement of n lines is 6 *(n^) (and if we allow 
degeneracies, possibly less). The complexity of the zone of an arbitrary curve in 
the arrangement can range from 6>(1) to 0[n^). The simple algorithm sketched 
above will always require f2[n^) time, which may be too much when the com- 
plexity of the zone is significantly below quadratic. (Note that even an optimal 
solution to our problem may require more than 0[n^) running time for certain 
input curves, since we do not impose any restriction on the curve and it may 
cross a single face an arbitrary large number of times.) 

If the whole curve 7 were given as part of the initial input (together with 
the lines in C) then we could have used one of several worst-case near-optimal 
algorithms to solve the problem. The idea is to transform the problem into that 
of computing a single face in an arrangement of segments, where the segments 
are pieces of our original lines that are cut out by 7 QQ. Denote by k the overall 
number of intersections between 7 and the lines in £, the complexity of the zone 
of 7 in A[C) (or the complexity of the single face containing 7 in the modified 
arrangement) is bounded b30O((n -I- k)a[n)) and several algorithms exist that 
compute a single face in time that is only a polylogarithmic factor above the 
worst-case combinatorial bound, for example Q. 

However, since in our problem we do not know the curve 7 in advance, we 
cannot use these algorithms, and we need alternative, on-line solutions. Our 
problem, the on-line version of the zone construction, is motivated by the study 
of area bisectors of polygonal sets in the plane which is in turn motivated 

^ a(-) denotes the extremely slowly growing inverse of the Ackermann function. 
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by algorithms for part orienting using Micro Electro Mechanical Systems 
The curve 7 that arises in connection with the area bisectors of polygons is 
determined by the faces of the arrangement through which it passes, it changes 
from face to face, and its shape within a face / is dependent on /. 

In the work reported here we assume that 7 is a polyline given as a set 
of m -|- 1 points in the plane vo,vi, , Vm, namely the collection of m segments 
vqVi, V 1 V 2 , ■ ■ ■ , Vm-iVm, given to the algorithm in this order. In the motivating 
problem 7 is a piecewise algebraic curve where each piece can have an arbitrarily 
high degree. We simplify the problem by assuming that the degree of each piece 
is one since our focus here is on the on-line exploration of the faces of the 
arrangement. The additional problems that arise when each piece of the curve 
can be of high degree are (almost completely) independent of the problems that 
we consider in this paper and we discuss them elsewhere Q. We further assume 
that we are given a bounding box B, i.e., the set £ contains four lines that define 
a rectangle which contains vq, . . .,Vm (see Figure^. 

An efficient solution to the on-line zone problem is given in [^, and it is based 
on the well-known algorithm of Overmars and van Leeuwen for the dynamic 
maintenance of halfplane intersection |^||j . The data structure described in 
is rather intricate and we anticipated that it would be difficult to implement. 
We resorted instead to simpler solutions which are nevertheless non-trivial. 

We devised and implemented four algorithms for on-line zone construction. 
The first algorithm is based on halfplane intersection and maintains a balanced 
binary tree on a set of halfplanes induced by £ (it is reminiscent of the Overmars- 
van Leeuwen structure, but it is simpler and does not have the good theoretical 
guarantee of running time as the latter). The second algorithm works in the dual 
plane and maintains the convex hull of the set of points dual to the lines in £. 
Its efficiency stems from a heuristic to recompute the convex hull when the hull 
changes (the set of points does not change but their contribution to the lower 
or upper hull changes according to the face of the arrangement that 7 visits). 
Algorithms 1 and 2 could be viewed as simple variants of the algorithm described 
in m. The third algorithm presents a novel approach to the problem. It combines 
randomized construction of the binary plane partition induced by the lines in £ 
together with maintaining a doubly connected edge list for the faces of the zone 
that have already been built. We give two variants of this algorithm, where the 
second variant (called Algorithm 3b) handles conflict lists more carefully 
than the first (Algorithm 3). Finally, the fourth algorithm is also a variant of 
Algorithm 3; however it differs from it in a slight but crucial manner: it refines 
the faces of the zone as they are constructed such that in the refinement no face 
has more than some small constant number of edges on its boundary. 

The four algorithms are presented in the next section, together with imple- 
mentation details and description of optimizations. In Section Q we describe a 
test suite of five input sets on which the algorithms have been examined, followed 
by a chart of experiment results. Then in Section 0we summarize the lessons 
we learned from the implementation, optimization and experiments. Concluding 
remarks and suggestions for future work are given in SectionQ 
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2 Algorithms and Implementation 

2.1 Algorithm 1: Halfplane Intersection 

Given a point p and a set £ of n lines, the face that contains p in the ar- 
rangement A{C) can be computed in 0(n log n) time by intersecting the set of 
halfplanes induced by the lines and containing p. The idea is to divide the set 
of halfplanes into two subsets, recursively intersect each subset, and then use 
a linear-time algorithm for the intersection of the two convex polygons (see, 
for example, ^ Chapter 4]). We shall extend this scheme to the on-line zone 
construction by maintaining the recursion tree. 

The General Scheme Let us assume we have a procedure Convexintersect 
that intersects two convex polygons in linear time. Given the first point vq of 7 , 
we construct the recursion tree in a bottom-up manner in 0 (n log n) time, as 
follows. The lowest level of the tree contains n leaves, which hold the intersection 
of the bounding box with each halfplane induced by a line of £ and containing vq. 
The next levels are constructed recursively by applying Convexintersect in 
postorder. The resulting data structure is a balanced binary tree, in which each 
internal node holds the intersection of the convex polygons of its two children. 
The first face the algorithm reports is the one held in the root of the tree. 

When the polyline moves from one face of the arrangement to its neighbor it 
crosses a line in £. All we need to do is update the leaf in the recursion tree that 
corresponds to this line (namely, take the other halfplane it determines), and 
then reconstruct the polygons of its ancestors, again in a bottom-up manner. 
Glearly, for each update the number of calls to Convexintersect is O(logn). 
The time it takes to move from one face to its neighbor therefore depends on 
the total complexity of the polygons in the path up the tree. In the worst case 
this can add up to 0{n) even if the returned face (i.e., the root of the tree) is 
of low complexity. However, if the complexity of each polygon along the path is 
constant, then the update time is only O(logn). 

Convex- Polygon Intersection The main “building block” of the algorithm 
is the Convexintersect procedure. We first implemented a variant of the algo- 
rithm described in Q Ghapter 4] . The idea is to decompose each convex polygon 
into two chains — left and righf and use a sweep algorithm to sweep down these 
four chains, handling the various possible cases in each intersection event (this 
algorithm corresponds, with minor modifications, to the Shamos-Hoey method 
described in EB)- This algorithm turned out to be quite expensive — in each 
event, several cases of edge intersections have to be checked, although some of 
them cannot appear more than twice. 

Therefore, we implemented a different algorithm that sweeps the two left 
chains and the two right chains separately, and then sweeps through the resulting 
left and right chains to find the top and bottom vertices of the intersection 
polygon. These sweeps are very simple, and can be implemented efficiently. For 
two polygons with 40 vertices, whose intersection contains 80 vertices, the new 
algorithm runs 4 times faster than the original one. 
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Optimization In addition to the improved intersection algorithm described 
above, we applied several other optimization techniques. The most important 
of these was the use of floating-point filters (see Section 1 which reduced 
the running time by a factor of 2. Another method we used to accelerate the 
intersection function was the use of a hashing scheme to avoid re-computation 
of intersections already computed (see Section However, this did not yield 
a significant speedup — for an input of 1000 random lines and a hash table 
with 60000 entries, we got an improvement of only 10-15 percent. 

2.2 Algorithm 2: Dual Approach 

The algorithm is based on the duality between the problems of computing the 
intersection of halfplanes and calculating the convex hull of a set of points. The 
duality transform maps lines (points) in the primal plane to points (lines) in 
the dual plane, preserving above/below relations — for example, a primal point 
above a primal line is mapped to a dual line which is above the line’s dual point. 

Given a set C of lines in the plane and a point p, we can find the intersection of 
halfplanes induced by the lines and containing p by first transforming £ to a set 
of points P in the dual plane, and transforming p to a dual line i. Next, we need 
to partition the set P into two subsets according to whether a point lies above 
or below £. We then compute the convex hull of each of the two subsets. Now, by 
spending an additional linear amount of work (specifically, finding the tangents 
connecting the two convex hulls) we obtain a list of points whose original primal 
lines make up the boundary of the respective intersection of halfplanes and in 
the desired order. For more details on the duality between convex hulls and 
halfplane intersection see, e.g., 0 Section 11.4]. 

Our algorithm is based on a modification by Andrew Q of Graham’s scan 
algorithm [J. The first step of this algorithm calls for sorting the set of points, 
and then the actual convex hull is computed in an 0(n) scan. Since our set of 
points is fixed, we can pre-sort them and, as the traversal progresses, shift points 
from one subset to the other without destroying the implied order - all we have 
to do is keep a tag with each point indicating the subset it currently belongs to. 
When moving from a face to one of its neighbors through an edge, a single dual 
point (whose primal line contains the crossed edge) must be moved from one 
subset to the other, and the convex hulls should be recomputed. Thus, after an 
initial 0{nlogn) work for sorting the points, the algorithm reports each face of 
the zone in 0(n) time. A few simple observations enable us to effectively reduce 
the needed amount of work in many cases. Due to lack of space, we will not give 
a detailed description of the algorithm in this paper. 

2.3 Algorithm 3: Binary Plane Partition 

As explained in Section^ a straightforward solution to the zone problem would 
be to construct the arrangement A{C), and then explore the zone of 7, face by 
face. However, constructing A{C) takes 0{n^) time (and space), whereas the 
actual zone may be of considerably lower complexity. In this section we shall 
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describe an algorithm that maintains a partial arrangement, namely a subset 
of the faces induced by £. The idea would be to construct only the parts of 
A{L) that are intersected by 7, adding faces to the partial arrangement on- 
line, as we advance along the curve. The algorithm we have developed is a 
variant of the randomized binary plane partition (BPP) technique. The partial 
arrangement A*{C) is represented in a data structure that is a combination of 
a doubly connected edge list (DCEL) and the BPP’s binary search tree, both of 
which are described in 0. We call this data structure a face tree. 



The Face Tree Data Structure The face tree is a binary search tree whose 
nodes correspond to faces in the plane. Each face is split into two sub-faces, 
using one of the input lines that intersect it. We assign an arbitrary orientation 
to each line in C. The sub-face that is to the left of the splitting line is set as 
the node’s left child, and the right sub-face as its right child. 

Formally, denote by f{u) the face that corresponds to node u. The set of 
inner segments of u is the intersection of the lines C with the face f{u): I{u) = 
{ ir\f{u) I £ G £, in f{u) yf 0 } ; the set I{u) is often referred to as the conflict 
list of f{u). The face f{u) is partitioned into two sub-faces using one of these 
inner segments, which we shall call the splitter of /(u), and is denoted s{u). We 
say that a face is final if I{u) = 0, in which case it always corresponds to a leaf 
in the face tree. Locating a point p in a tree whose depth is d takes 0{d) time, 
using a series of calls to the SideOfLine predicate, i.e., simple queries of the 
form - “is p to the left of s(u)?”. 

As we have already mentioned, the faces are not only part of a binary search 
tree - they also form a DCEL. Each face is described by a doubly connected cyclic 
linked list of its edges (or half-edges, as they are called in ^). For each half-edge 
in the DCEL, we maintain its twin half-edge and a pointer to its incident face, 
as in a standard DCEL. These pointers enable us to follow 7 efficiently along 
adjacent faces, which is useful when 7 revisits faces it has already traversed 
before. 

Let u be a leaf corresponding to a non-final face in A*{C), with n„ inner 
segments. In order to split the face /(u), we first choose a splitter s(u) from 
I{u) at random; we then prepare the lists that describe the edges of the two 
sub-faces in constant time; and, finally, we prepare the inner segments list of 
each sub-face. Thus, splitting a face takes a total of 0(n„) time. 



The General Scheme The algorithm starts by initializing the root r of the face 
tree as the bounding box B, and its set of inner segments /(r) as { £Ci3 | £ G £ }. 
Given the first point uq of the polyline, we locate it in the face tree. Obviously, we 
will find it in /(r). Since r is not a final leaf, we split it, and continue recursively 
in the sub-face that contains uq, as illustrated in FigureQ The search ends when 
we reach a final leaf, that is, a face /o that is not intersected by any line in £. 

Suppose we are now given the next vertex vi along 7. If vqVi exits /o at point 
Ve on edge Ce (see Figure Q, then we skip to the opposite face using Ce’s twin- 
edge pointer - e*, and locate Vg in it (actually, we locate a point infinitesimally 
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Fig. 2. An illustration of the construction of the face tree given the first point vq of 
the polyline: first, the root face /(r) is set as the bounding-box; then, /(r) is split 
into two sub-faces - f{u), which contains vq, and f{v) {!)', similarly, f{u) is split 
into two sub-faces - f{w) and f{x) (2)', and so on until no input line crosses the 
face /o which contains vq 



close to Ve and into the face) - as long as we are in an internal node of the face 
tree, we simply check on which side of the corresponding splitter lies Ve, and 
continue the search there; once we reach a leaf f' , we repeat the same splitting 
algorithm as described above, until we find the face fi that contains Vg (we now 
update the twin-edge pointers of both Cg and e* fl /i , so that next time 7 moves 
from /o to /i, or vice versa, we could follow it in 0 ( 1 ) time). 



Improved Maintenance of the Conflict Lists The main shortcoming of the 
algorithm we have just described is its worst-case behavior of time for 

preparing a face with 0(n) edges, for example input 5 in SectionQ To overcome 
this shortcoming, we developed a data structure that enables us to split a face 
without having to check all its inner segments. Instead of maintaining the inner 
segments in a simple array, we distribute them among the vertices of the face 
in conflict lists. Denote by p the point that is being located {p is either the first 
vertex of the polyline, or its exit point from the previous face) . An inner segment 
s in face / is said to be in conflict with the vertex v ii v and p lie on different 
sides of s. According to our new approach, we maintain a list of inner segments 
in each vertex, so that all the segments that are held in the list of a vertex v 
are in conflict with v. Each inner segment is kept in only one list, at one of the 
vertices that it is in conflict with. 

In the first stage of the algorithm, we wish to locate the first vertex vq of the 
polyline. To this end, we first prepare the conflict lists of the bounding box by 
simply checking for each input line with which vertices it is in conflict, and adding 
the line to one of the corresponding conflict lists at random. We then split the 
bounding box using a randomly chosen splitter, and continue recursively until 
we reach a final face. It remains to show how the conflict lists can be updated 
when the face is split. Let / be a face that contains vq, and denote by s its 
splitter, s divides / into two sub-faces — fg, or the green face, that contains vq, 
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Fig. 3. Following 7 along A*{C) after the first face (/o) has been built: first, we 
find the exit point Ve (on edge eg) and skip to the adjacent face f using the twin 
edge Ct, then, from f we build the face fi through which 7 passes using the same 
recursive procedure we used for building /o from /(r) (as shown in FigureQ 



and fr, the red face. Let u\ and U 2 be the endpoints of s. In order to construct 
the final face that contains vq, we need to work (i.e., recursively split) only on 
the green face. We therefore update its conflict lists — for each segment in one 
of fr’s conflict lists, we add it to the conflict list of ui (if it is in conflict only 
with ui), U 2 (similarly), one of ui and U 2 at random (if it is in conflict with 
both vertices), or none (in case it is in conflict with neither ui nor U 2 , which 
means that it does not intersect fg). This algorithm guarantees that the face 
that contains vq is constructed in expected time 0 (n log n) (see my 

The rest of the algorithm is very similar to the original scheme - given the 
next vertex along the polyline, we first locate the exit point from the current 
face, and construct the next face 7 intersects. However, the conflict lists in the 
red faces need to be updated, since the definition of a conflict depends on the 
point being located, and it has changed. Furthermore, some of the inner segments 
might be missing, and we need to collect them from the relevant green face, i.e., 
the sibling of the red face. The solution we implemented is to gather all the 
inner segments from the current face and from the sub-tree of the sibling node, 
and prepare new conflict lists, as we have done for the bounding box. Once the 
conflict lists are built, we continue as explained above. 

This new algorithm, which we will denote 3b, gave a substantial performance 
improvement, especially for the cases of complex faces (see SectionQ. Both algo- 
rithm 3 and 3b use floating-point Alters to avoid performing exact computations 
when possible. 

2.4 Algorithm 4: Randomized Incremental Construction 

Recently, Har-Peled Q3 gave an 0{{n + k)a{n) logn) expected time algorithm 
for the on-line construction of the zone, where k is the number of intersections 
between 7 and the input lines. This improves by almost a logarithmic factor 
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the application of Overmars and van Leeuwen for this restricted case [^. 
The algorithm of n relies on a careful simulation of an off-line randomized 
incremental algorithm (which uses an oracle) that constructs the zone. We had 
implemented a somewhat simpler variant, which is similar to algorithm 3 (Sec- 
tion with the difference that we split complex faces so that every internal 
node in the face tree T will hold a polygon of constant complexity. For a node v, 
let Ry be the polygonal region that corresponds to it (this may be a whole face 
in the arrangement, as in algorithm 3, or part of a face), and let I{v) be its 
conflict list, i.e., the list of input lines that intersect Ry. 

Computing a leaf of T which contains a given point p is performed by carrying 
out a point-location query, as in algorithm 3. We divert from this algorithm, in 
the following way: if a face Ry created by the algorithm has more than c vertices 
(where c is an arbitrary constant), we split it into two regions, by a segment s 
that connects two of its vertices (chosen arbitrarily), thus generating two new 
children in the tree T, such that the complexity of each of their polygons is at 
most [c/2] -I- 1 (note that, in general, s does not lie on one of the lines in C). 
Now, a node in T corresponds to a polygon having at most c vertices (a similar 
idea was investigated in C3). This seems to reduce the expected depth of the 
tree, which we conjecture to be logarithmic. 

To compute a face, we first compute a leaf v of T that contains our current 
point p, as described above. We next compute all the leaves that correspond to 
the face that contains p, by performing a point-location query in the middle of a 
splitter lying on the boundary of Ry . By performing a sequence of such queries, 
we can compute all the leaves in T that correspond to the face that contains p. 
Reconstructing the whole face from those leaves is straightforward. 

The walk itself is carried out by computing for each face the point where the 
walk leaves the face, and performing a point-location query in T for this exit 
point, as in algorithm 3. 



Computing the Conflict Lists Using Less Geometry When splitting a 
region Ry into two regions Ry+,Ry~, we have to compute the conflict lists of 
Ry+, Ry-. For a line i G I{Ry), this can be done by computing the intersection 
between i and Ry+, Ry-, but this is rather expensive. Instead, we do the follow- 
ing: for each line in the conflict list of Ry, we keep the indices of the two edges 
of dRy that i intersects. As we split Ry into Ry+ and Ry-, we compute for each 
edge of Ry the edges of Rf, Rf it is being mapped to. Thus, if the line i does not 
intersect the two edges of Ry that are crossed by the splitting line, then we can 
decide, by merely inspecting this mapping between indices, whether i intersects 
either Ry+, Ry-, or both, and what edges of Rf, Rf are being intersected by £. 
However, there are situations where this mapping mechanism is insufficient. In 
such cases, we use the geometric predicate SideOfLine, which determines on 
which side of a line a given point lies, as illustrated in Figure^ While there is a 
non-negligible overhead in computing the mapping, the above technique reduces 
the number of calls to the SideOfLine predicate by a factor of two (the predi- 
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cate SideDf Line is an expensive operation when using exact arithmetic, even if 
floating-point Altering is applied). 




Fig. 4. Computing the conflict lists of R^,R~: The lines h,l 2 can be classified 
by inspecting their edge indices and the indices mapping. Classifying I 3 is done by 
inspecting its edge indices and computing SideOfLine of the points p,q (or q,r) 
relative to I 3 



Optimization In addition to caching results of computations (see Section^3, 
massive Altering was performed throughout the program. The lines are repre- 
sented both in rational and floating-point representations. The points are repre- 
sented as two pointers to the lines whose intersection forms the point. Whenever 
the result of a geometric predicate is unreliable (i.e., the floating-point result 
lies below a certain threshold) the exact representation of the geometric enti- 
ties involved are calculated, and the predicate is recomputed using exact arith- 
metic. This Altering results in that almost all computations are carried out using 
floating-point arithmetic, which gives a speedup by a factor of two with respect 
to the standard floating-point Alters. 

3 Experiments 

In order to test the programs that implement the four algorithms we described 
in the previous section, we have created 5 input flies (see FigureQ. In inputs 1 
and 2, the lines form a grid; inputs 3 and 4 consist of random lines; and in 
input 5 there is a very complex face to which each input line contributes one 
edge. The polyline in inputs 1, 3, and 5 is a segment; in inputs 2 and 4, it consists 
of 50 and 35 segments respectively. We ran the programs on the test suite and 
measured average running times, based on 25 executions, for n = 2000, namely 
each input Ale contained 2000 lines (see TableQ. 
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Fig. 5. The five inputs whose details, as well as the running times of the algorithms 
on them, are given in TableQ The figures show the input lines [bright lines) and 
the polyline (darker). The arrangements are depicted here with only 50 lines, for 
reasons of clarity 



In general, the best results were achieved by algorithms 3b and 4, whose 
running times were considerably faster than those of algorithms 1 and 2. It is 
important to note that each algorithm was implemented by a different program- 
mer, applying different optimization techniques, as will be discussed in Section^ 
Comparing the performance of the programs might therefore be misleading with 
regard to the algorithms’ potential performance. Nevertheless, some conclusions 
can be drawn from our experiments. 

Algorithm 2 is very slow overall, but performs quite well for the first face, as 
input 5 demonstrates. Perhaps the reason is that most of the work it performs 
in order to compute the first face is done by the function that sorts the dual 
points, and this function is relatively fast. As mentioned earlier, the improved 
maintenance of the conflict lists reduced the running time of algorithm 3b, com- 
pared to algorithm 3, on input 5. Interestingly, it also performs better on the 
other inputs. Algorithm 4 gave similar results to algorithm 3b, but after massive 
optimization (Section it surpassed it by a factor of up to 3. 

Figure0shows the performance of algorithm 3b on input 4, for various n’s. 
Empirically, it seems that the algorithm computes the first face in an arrange- 
ment of random lines in linear time, and reports each of the following faces 
in O(logn) time on average. Similar results were obtained for input 2 (the grid). 
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Input file 


1 


2 


3 


4 


5 


of vertices in the polyline 


2 


50 


2 


35 


2 


# of faces in the zone 


667 


6289 


1411 


12044 


2 


(without multiplicities) 


(667) 


(6271) 


(1411) 


(11803) 


(1) 


Algorithm 1 


25.896 


50.687 


15.217 


102.905 


3.436 


Algorithm 2 


69.328 


85.128 


27.715 


232.122 


1.049 


Algorithm 3 


2.706 


7.565 


6.017 


23.161 


5.384 


Algorithm 3 b 


1.478 


5.338 


2.205 


14.079 


1.036 


Algorithm 4 


0.482 


2.102 


0.700 


4.334 


1.047 



Table 1. Summary of the test data and running times of the four algorithms de- 
scribed in SectionQ All inputs contain 2000 lines. Times are average running times 
in seconds, based on 25 executions on a Pentium-1 1 450MHz PC with 512MB RAM 
memory 




Fig. 6. Running times of algorithm 3b on input 4 — a random arrangement, for 
various n's. The figure shows average times, based on 25 executions, for computing 
the first face {crosses, in seconds) and the rest of the faces {circles, in milliseconds) 



4 Discussion 

In this section, we summarize the major conclusions drawn from our experience, 
especially with regard to optimization techniques. 



4.1 Software Design 

Motivated by the structure of the CGAL basic library [^, in which geometric 
predicates are separated from the algorithms (using the traits mechanism 0)> 
we restricted all geometric computations in the programs to a small set of geo- 
metric predicates. This enabled us to debug and profile the programs easily, and 
deploy caching, filtering, and exact arithmetic effortlessly. Writing such predi- 
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cates is not always trivial, but fortunately LEDA S'lid CGAL m provide 
such implementations. 

4.2 Number Type and Filtering 

To get good performance, one would like to use the standard floating-point arith- 
metic. However, this is infeasible in geometric computing, where exact results 
are required, due to precision limitations. Since our input is composed of lines 
and vertices in rational coordinates, we can perform all computations exactly 
using LEDA rational type. However, LEDA rational suffers from several draw- 
backs: (i) computations are slow (up to 20-40 times slower than floating point), 
(ii) they consume a lot of memory, and (iii) the bit length of the representation 
of the numbers doubles with each operation, which in turn causes a noticeable 
slowdown in program execution time. 

One possible approach to improve the efficiency of the representation of a 
LEDA rational number is to normalize the number explicitly. However, this op- 
eration is expensive, and the decision where to do such normalization is not 
straightforward. A different approach is to use filtering im . In general, filtering 
is a method of carrying out the computations using floating-point (i.e., fast and 
inexact) arithmetic, and performing the computations using exact arithmetic 
only when necessary. LEDA ^ provides a real number type, which facilitates 
such filtered computations (it is not restricted to rational operators or geometric 
computations) . Additionally, LEDA provides a fine-tuned computational geom- 
etry kernel that performs filtering. Algorithms 1, 3, and 3b used LEDA’s filtered 
predicates (e.g., SideOfLine), which resulted in a speedup by a factor of two or 
more, compared to the non-filtered rational computations. Furthermore, one can 
also implement the filtering directly on the geometric representation of the lines 
and points (for example, computing the exact coordinates of the points only 
when necessary), as was done in algorithm 4 (see Section . As mentioned 
earlier, this technique saves a considerable amount of exact computations, and 
gives an additional speedup of two. Overall, usage of filtering resulted in the 
most drastic improvement in the running times of the programs, and was easy 
to implement. 

4.3 Caching 

One possible approach to avoid repeated exact computations is to cache results of 
geometric computations (i.e., intersection of lines, SideOfLine predicate, etc.). 
Such a caching scheme is easy to implement, but does not result in a substantial 
improvement. Especially, as the filtered computations might require less time 
than the lookup time. Caching was implemented in algorithms 1 and 4. 

4.4 Geometry 

The average number of vertices of a face in an arrangement of lines is about 4 
(this follows directly from Euler’s formula). Classical algorithms rely on ver- 
tical decomposition, which have the drawback of splitting all the faces of the 
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arrangement into vertical trapezoids. An alternative approach is to use constant 
complexity convex polygons instead of vertical trapezoids (as was done in algo- 
rithm 4). The results indicate that this approach is simple and efficient. This 
idea was suggested by Matousek, and was also tested out in ^3- 

4.5 Miscellaneous 

Additional improvement in running time can be achieved by tailoring predi- 
cates and functions to special cases, instead of using ready-made library ones. 
For example, knowing that two segments intersect at a point, liberates us from 
the necessity to check whether they overlap. Such techniques were used in al- 
gorithms 1, 3, and 3b. Significant improvement can also be accomplished by 
improving “classical” algorithms for performing standard operations (see, for 
example, Section^^- 

5 Conclusions 

We have presented four algorithms for on-line zone construction in arrangements 
of lines in the plane. All algorithms were implemented, and we have also pre- 
sented experimental results and comparisons between the algorithms. 

A major question raised by our work is what could be said theoretically about 
the on-line zone construction problem. As mentioned in the Introduction, 
proposes a near-optimal output-sensitive algorithm based on the Overmars-van 
Leeuwen data structure for dynamic maintenance of halfplane intersection. It 
would be interesting to implement this data structure and compare it with the 
algorithms that we have implemented. 

If the number of faces in the zone of the curve 7 is small (constant), then 
algorithm 3b described in Section is guaranteed to run in expected time 
O(nlogn), since in that case it is an almost verbatim adaptation of the random- 
ized incremental algorithm for constructing the intersection of halfplanes, whose 
running time analysis is given in Section 3.2]. Indeed it performs very well 
on input 5 (Section Q. An interesting open problem is to extend the analysis 
of 03 for the case of a single face to the case of the on-line zone construction. 

Algorithm 4 is an attempt to implement a practical variant of a theoretical 
result giving a near-optimal output-sensitive algorithm for the on-line zone con- 
struction 03 (the result in 03 motivated by the good results of algorithm 3 
and differs from it only slightly but, as mentioned above, in a crucial factor — 
by subdividing large faces into constant size faces). Still, the practical variant 
which was implemented has no theoretical guarantee. Can the analysis of 03 
be extended to explain the performance of algorithm 4 as well? 

Finally, as mentioned earlier, the algorithms have been implemented inde- 
pendently, and hence there are many factors that influence their running time 
on the test suite beyond the fundamental algorithmic differences. We are cur- 
rently considering alternative measures (such as the number of basic operations) 
that will allow for better comparison of algorithm performance. 
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Abstract. Planar maps are fundamental structures in computational 
geometry. They are used to represent the subdivision of the plane into 
regions and have numerous applications. We describe the planar map 
package of CGAlfl — the Computational Geometry Algorithms Library. 
We discuss problems that arose in the design and implementation of the 
package and report the solutions we have found for them. In particu- 
lar we introduce the two main classes of the design — planar maps and 
topological maps that enable the convenient separation between geome- 
try and topology. We also describe the geometric traits which make our 
package flexible by enabling to use it with any family of curves as long 
as the user supplies a small set of operations for the family. Finally, we 
present the algorithms we implemented for point location in the map, 
together with experimental results that compare their performance. 



1 Design Overview 

We describe the design and implementation of a data structure for representing 
planar maps. The data structure supports traversal over faces, edges and vertices 
of the map, traversal over a face and around a vertex, and efficient point location. 
Our design was guided by several goals, among them: (i) ease-of-use, for example, 
easy insertion of new curves into the map; (ii) flexibility, for example, the user can 
define his/her own special curves as long as they support a predefined interface; 
(iii) efficiency, for example, fast point location. 

Our representation is based on the Doubly Connected Edge List (DCEL) 
structure Q Chapter 2] . This representation belongs to a family of edge-based 
data structures in which each edge is represented as a pair of opposite halfedges 
Our representation supports inner components (holes) inside the faces 
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making it more general and suitable for a wide range of applications (e.g., geo- 
graphical information systems) . A planar map is not restricted to line segments, 
and our package can be used for any collection of bounded s-monotone curves 
(including vertical segments). 




Fig. 1. Source and target vertices, and twin halfedges in a face with a hole 



The package is composed of two main classes: CGAL_Topological_map which 
is described in Section^^and CGAL_Planar_map_2 which is described in Sec- 
tion^3 

The design, depicted in Figure Q follows Cgal’s polyhedron design intro- 
duced in The bottom layer holds base classes for vertices, halfedges and 
faces. Their responsibilities are the actual storage of the incidences, the geom- 
etry and other attributes. Addition of attributes (such as color) by the user is 
easily done by defining one’s own bottom-layer classes (possibly by deriving from 
the given base classes and adding the attributes). 

The next layer, the Dcel layer, is a contained that stores the classes of 
the bottom layer and adds functionality for manipulating between them and 
traversing over them. 

The topological map layer adds high-level functions, high-level concept^for 
accessing the items, i.e., handles, iterators and circulators (unlike the Dcel layer 



2 

3 



We use the term container as it is used in the C-|— I- Standard Template Library, 
STL i.e., a class that stores a collection of other classes. 

By concepts we mean classes obeying a set of predefined requirements, which can 
therefore be used in generic algorithms. For example, our iterators obey the STL 
(Standard Template Library |yj[m) requirements and can therefore be used in STL’s 
generic algorithms. 
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Fig. 2. Design overview 



pointers are no longer visible at this interface), and protection of combinatorial 
validity. 

The top layer, the planar map layer adds geometry to the topological map 
layer using a Traits template parameter (Section Q. We provide high level 
functionality based on the geometric properties of the objects. Geometric queries 
— point location and vertical ray shooting — are also introduced in this class. We 
provide three implementations, and a mechanism (the so-called strategy pattern 
Q) for the users to implement their own point location algorithm. The abstract 
strategy class CGAL_Pm_point_location_base <Planar_map> is a pure virtual 
class declaring the interface between the algorithm and the planar map. The 
planar map keeps a reference to the strategy. The concrete strategy is derived 
from the abstract class and implements the algorithm interface. 

We have derived three concrete strategies: CGAL_Pm_default_point_location, 
CGAL_Pm_naive_point_location, and CGAL_Pm_walk_point_location which 
is an improvement over the naive one. Each strategy is preferable in different 
situations. The default class implements the dynamic incremental randomized 
algorithm introduced by Mulmuley [^. The naive algorithm goes over all the 
vertices and halfedges of the planar map. Namely, in order to find the answer to 
an upward vertical ray shooting query, we go over all the edges in the map and 
find the one that is above the query point and has the smallest vertical distance 
from it. Therefore, the time complexity of a query with the naive class is linear 
in the complexity of the planar map. The walk algorithm implements a walk over 
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the zone in the planar map of a vertical ray emanating from the query point. 
This decreases the number of edges visited during the query, thus improving the 
time complexity. The main trade-off between the default strategy and the two 
other strategies, is between time and storage. The naive and walk algorithms 
need more time but almost no additional storage. Section ^describes the point 
location classes. 

2 Planar Maps and Topological Maps 

A planar map has a combinatorial (or topological) component and a geometric 
one. The combinatorial component consists of what we refer to as combinatorial 
objects — vertices, halfedges and faces, and the functionality over them, for 
example — traversal of halfedges around a face, traversal of halfedges around 
a vertex or finding the neighboring face. The geometric component consists of 
geometric information like point coordinates and curve equations, and geometric 
functionality such as point location. We carry out this separation in our design 
with two classes — CGAL_Topological_map and CGAL_Planar_map_2 which is 
derived from it. 



2.1 Topological Map 

The topological map class is meant to be used as a base for geometric subdivi- 
sions (as we have done for 2D planar maps). It consists of vertices, edges and 
faces and an incidence relation on them, where each edge is represented by two 
halfedges with opposite orientations. A face of the topological map is defined 
by the circular sequences of halfedges along its inner and outer boundaries. The 
presence of a containment relationship between a face and its inner holes, is a 
topological characteristic which distinguished it from standard graph structures, 
and from other edge based combinatorial structures. This enables us to derive 
subdivisions with holes in them. 

CGAL_Topological_map can be used as a base class for deriving different 
types of geometric subdivisions. We also regard it as provisory for implementing 
three-dimensional subdivisions induced by algebraic surfaces, for example, a two- 
dimensional map on a sphere or a polyhedral terrain. It can also be used almost 
as it is as a representation class for polygons with holes, by merely adding point 
coordinates as additional attributes to the vertices. 

The following simple function combinatorial_triangle () demonstrates the 
use of CGAL_Topological_map. It creates an empty map (with one face corre- 
sponding to the unbounded face) and then inserts an edge el inside the un- 
bounded face. It then inserts an edge e2 from the target vertex of el and finally 
inserts an edge between the target vertices of e2 and el->twin(), closing a 
“combinatorial” triangle (i.e., a closed cycle of three vertices without coordi- 
nates) . 
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typedef CGAL_Pm_dceKCGAL_Tpm_vertex_ba.se , CGAL_Tpm_half edge_base , 

CGAL_Tpm_f ace_base> Dcel; 

typedef CGAL_Topological_map<Dcel> Tpm; 

void combinatorial_triangle 0 { 

Tpm t ; 

Tpm: :Face_handle uf =t . unbounded_f ace () ; 

Tpm: : Half edge_handle el = t . insert_in_face_interior (uf ) ; 

Tpm: : Half edge_handle e2 = t . insert_from_vertex(el) ; 
t . insert _at_vert ices (e2 ,el->twin() ) ; 



Addition of attributes such as point coordinates is easy. The following ex- 
ample demonstrates how to add an attribute (in this case some Point type) to 
a vertex of a map. It creates a new vertex type My_vertex that derives from 
CGAL_Tpm_vertex_base and adds the attribute. The new vertex is then passed 
as a template parameter to the Dcel. After the insertion of the new edge the 
information in its incident vertices can be updated by the user. This can be used, 
for example, as a representation class for polygons with holes in them. 

struct My_vertex : public CGAL_Tpm_vertex_base { 

Point pt ; 

} 

typedef CGAL_Pm_dcel<My_vertex, CGAL_Tpm_halfedge_base , 

CGAL_Tpm_f ace_base> Dcel; 
typedef CGAL_Topological_map<Dcel> Tpm; 

void insert_with_info() f 
Tpm t ; 

Tpm: :Face_handle uf=t .unbounded_face() ; 

Tpm: : Half edge_handle el = t . insert_in_face_interior (uf ) ; 
el->source()->pt = Point(0,0); 
el->target () ->pt = Point(l,l); 

} 

2.2 Planar Map 

The CGAL_Planar_map_2 class is derived from CGAL_Topological_map. It repre- 
sents an embedding of a topological map T in the plane such that each edge of T 
is embedded as a bounded a;-monotone curve and each vertex of T is embedded 
as a planar point. In this embedding no pair of edges intersect except at their 
endpointfl 

CGAL_Planar_map_2 adds geometric information to the topological map it is 
derived from. The Traits template parameter (SectionQ defines the geometric 

^ We are currently implementing an arrangement layer on top of the planar map 
layer, in which the intersection of curves is supported, and in which curves need not 
be r-monotone. 
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types for the planar map and the basic geometric functions on them. The planar 
map implements its algorithms in terms of these basic functions. 

The additional geometric information enables geometric queries, and an eas- 
ier interface that uses the geometry, e.g., the users can insert a curve into the 
map without specifying where to insert it, and the map finds this geometri- 
cally. The modifying functions of the topological map (e.g., remove_edge) are 
overridden and are implemented using the geometric information. We have also 
overridden the combinatorial insertion functions (e.g., insert_f rom_vertex) so 
they use the geometric information. If the users have some combinatorial in- 
formation (e.g., that one of the endpoints of the new curve is incident on one 
of the already existing vertices of the planar map) they can use the specialized 
insertion function instead of the more general one thus making the operation 
more efficient. 

The following function geometric_triangle () creates an empty map and 
then inserts three segments into it, corresponding to a triangle. It resembles the 
example of the previous section and shows the difference between the interface 
of the topological map and the planar map. 

typedef CGAL_Homogeneous<long> Rep; 
typedef CGAL_Pm_segment_exact_traits<Rep> Traits; 
typedef CGAL_Pm_default_dcel<Traits> Dcel; 
typedef CGAL_Planar_map_2<Dcel,Traits> Pm; 

void geometric_triangle () { 

Pm p; 

Traits :: Point pl(0,0), p2(l,0), p3(0,l); 

Traits :: X_curve cvl(pl,p2), cv2(p2,p3), cv3(p3,pl); 
p . insert (cvl) ; 
p . insert (cv2) ; 
p . insert (cv3) ; 

} 

2.3 Problems and Solutions 

The separation of the design into a topological layer and a geometric layer creates 
some algorithmic problems. When designing a topological class such as the topo- 
logical map, no geometric considerations can be used. This imposes constraints 
on the functions and function parameters of the topological map. For some func- 
tions (e.g., the split_edge function), the topological information suffices, while 
for others additional topological information is needed to avoid ambiguity. In 
the geometric layer this information can be deduced from the geometric infor- 
mation. In the remainder of this section we present examples of such algorithmic 
problems and their solutions in our design. 



The Use of Previous Halfedges When inserting a new edge from a vertex v 
using the function insert_f rom_vertex, passing only vto the function will cause 
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Fig. 3. Previous halfedges are necessary: Passing only ?; as a parameter to the 
insertion function cannot resolve whether el or e2 is the edge to be inserted into 
the map 



ambiguity. FigureQshows an example of two possible edges that can be inserted 
if we had only passed the vertex. Topologically, what defines the insertion from 
a vertex uniquely is the previous halfedge to the edge inserted. Therefore, this 
is what is passed in our topological function. In the geometric layer, passing the 
vertex v is sufficient, since we can find the previous half edge geometrically: we 
go over the halfedges around v and find geometrically the halfedge which is the 
first halfedge incident to v that is encountered when proceeding from the new 
inserted curve in counterclockwise order. 




Fig. 4. The need for the moveJiole function: After inserting a new edge between 
vl and v2 the original face is split into two faces. Since we have no knowledge of 
the geometry of the curves in the map, we cannot define topologically which of 
the two faces contains the hole h 
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The Use of the moveJiole Function When inserting a new edge between 
two vertices vl and v2, a new face might be created. Figure 0shows an example 
of two possible edges that can be inserted into a topological map. If el is inserted 
the hole h should be in the right face, if e2 is inserted the hole h should be in 
the left face. The topological map has no geometric knowledge of the curves, so 
it cannot determine which of the two faces contains the hole. Therefore, this is 
done by the client (the derived class) of the topological map (e.g., the planar 
map), which calls the function move_hole if a hole needs to be moved to a new 
face created after insertion. 

In the planar map class we go over the holes of the original face and check 
if they are inside the new face that was created, if so the hole is moved to the 
new face using the move_hole function. Since our planar map assumes that no 
curves intersect, we check whether a hole is inside a face by checking if one of 
its vertices, call it v, is inside the face. This is done using a standard algorithm. 
Conceptually, we shoot a ray from v vertically upwards and count the number of 
halfedges of the face that intersect it. If the number is odd the point is inside the 
face. In practice, we do not need to implement an intersection function (which 
can be expensive for some curves), we just need a function that defines if a curve 
is above or below a point. We then go over the halfedges of the face and use it 
to count how many of them are above the vertex. 



3 Geometric Traits 

The geometric traits class is an abstract interface of predicates and functions that 
wraps the access of an algorithm to the geometric (rather than combinatorial) 
inner representation. 

In the planar map package, we tried to define the minimal geometric interface 
that will enable a construction and handling of a geometric map. Packing those 
predicates and functions under one traits class helped us achieve the following 
goals: flexibility in choosing the geometric representation of the objects ^Homo- 
geneous, Cartesian); flexibility in choosing the geometric kernel (LEDi^ Cgal 
or a user-defined kernel) ; ability to have several strategies for robustness issues; 
extendibility to maps of objects other than line segments. 

The documentation of the planar map class gives the precise requirements 
that every traits class should obey. We have formulated the requirements so 
they make as little assumptions on the curve as possible (for example, linearity 
of the curve is not assumed). This enables the users to define their own traits 
classes for different kinds of curves that they need for their applications. The 
only restriction is that they obey the predefined interface. 

The first task of the planar map traits is to define the basic objects of the map: 
the point (Point) and the a;-monotone curve (X_curve). In addition four types 
of predicates are needed: (i) access to the endpoints of the a;-monotone curves; 
(ii) comparison predicates between points; (iii) comparisons between points and 

® LEDA — Library of Efficient Data structures and Algorithms, 
http : //www.mpi-sb.mpg . de/LEDA/leda. html 
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x-curves (e.g., whether the point is above the curve); (iv) predicates between 
curves (e.g., comparing the y-coordinate of two curves at a given a;-coordinate) . 
This interface of the four types of predicates satisfies the geometric needs of the 
planar map. 

In the current version of the planar map package we implemented the follow- 
ing traits classes: 

— CGAL_Pm_segment_exact_traits<R> — a class for planar maps of line seg- 
ments that uses Goal’s kernel. The R template parameter enables the use 
of Coal’s homogeneous or Cartesian kernel. This class is robust when used 
with exact arithmetic. 

— CGAL_Pm_leda_segment_exact_traits — also handles segment using exact 

arithmetic, but using LEDA Q rational points and segments, the predicates 
become faster. One of the differences that makes this traits class more effi- 
cient is the use of LEDA’s primitive predicates (e.g., orientation) that are 
implemented using floating point “filters” which speed up the use of 

exact computations. 

— Arr_circles_real_traits — a class that introduces circular arcs as x- 
monotone curves. It uses LEDA real number type (to support robust square- 
root predicates) for calculation. This traits is mainly provided for arrange- 
ments of circles but can also be used with planar maps. 

4 Point Location 

The point location strategy enables the users to implement their own point 
location algorithm which will be used in the planar map. We have implemented 
three algorithms: (i) a naive algorithm — goes over all the edges in the map to 
find the location of the query point; (ii) an efficient algorithm (the default one) 
— Mulmuley’s randomized incremental algorithm; and (iii) a “walk” algorithm 
that is an improvement over the naive one, and finds the point’s location by 
walking along a line from “infinity” towards the query point. In the following 
sections we describe these algorithms and their implementation. 



4.1 Fast Point Location 



As mentioned above the default point location is based on Mulmuley’s random- 
ized, fully dynamic algorithm (see ^^^). 

We remind the reader that our point location implementation handles gen- 
eral finite planar maps. The subdivision is not necessarily monotone (each face 
boundary is a union of x-monotone chains), nor connected and possibly contains 
holes. In addition the input may be a;-degenerate. 



Algorithm Our implementation supports insertions as well as deletions of map 
edges, while maintaining an efficient point location query time and linear stor- 
age space. This is achieved by a “lazy” approach that performs an occasional 
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rebuilding step whenever the internal structure (the history DAG — Directed 
Acyclic Graph) passes predefined thresholds in size or in depth. The rebuilding 
step is an option that can be finetuned or disabled by the user. 



Implementation Details Our implementation consists of two structures: An 
augmented trapezoidal map and a search structure. The trapezoidal map is a 
“uniform” collection of X_trapezoids, where each X_trapezoid corresponds to 
a subset of the plane bounded above and below by curves and from the sides by 
vertical attachments', see Q Chapter 6] for more details. 

Each X_trapezoid is one of four types: a non-degenerate X_trapezoid, a 
“curve like” X_trapezoid, a “point like” X_trapezoid, and a vertical one. The 
non-degenerate X_trapezoid corresponds to the area inside its geometric bound- 
aries; the “curve like” X_trapezoids correspond to the interior of the input 
curves, with possibly many X_trapezoids corresponding to one curve; the “point 
like” X_trapezoids correspond to the endpoints of the input curves. The vertical 
degenerate X_trapezoids will be discussed later. 

These four types are represented in the same way. Each X_trapezoid stores 
information regarding its geometric boundaries: left and right endpoints, bottom 
and top bounding curves, boundedness bit vector denoting, for example whether 
the left endpoint is infinite or not; its geometric neighborhood: left bottom, left 
top, right bottom, right top neighbors; and the node in the search structure that 
represents the X_trapezoid. In addition an X_trapezoid is either active or 
inactive. In the beginning we have only one active X_trapezoid representing 
the entire plane and no inactive X_trapezoids. As the structure thickens active 
X_trapezoids become inactive and new active ones are created. This is done 
while preserving the property that the active X_trapezoid form a subdivision 
of the plane (in the sense that their union covers the plane, and they do not 
overlap) . 

The active non-degenerate X_trapezoids at any stage are independent of 
the order in which the map was updated, that is, any update order generates 
the same decomposition. The invariant decomposition is called a vertical de- 
composition or a trapezoidal decomposition. The planar map induced by the 
decomposition is also known as a trapezoidal map. 

In contrast, the search structure, also referred to as the history DAG, is 
dependent on the update order. The inner nodes are the “curve like” , “point like” 
and non-active nodes while the leaves are the currently active non-degenerate 
and degenerate nodes. 

We separate between geometric and combinatorial data, namely the user 
can supply his/her own search structure as a parameter when instantiating an 
object of the CGAL_Trapezoidal_decomposition class. The default is a specially 
designed class called CGAL_Pm_DAG. 



a;-Degeneracies There is an inherent difficulty working with vertical decom- 
position when the input is x-degenerate, for example, when two points have the 
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same x coordinate. The difficulty arises from the definition, as the vertical at- 
tachments are assumed to be disjoint in their interiors. The algorithm solves this 
by using a symbolic shear transformation as shown in Section 6.3]. This solu- 
tion involves creating new nodes in the history DAG with corresponding vertical 
X_trapezoids of zero area. 

Consider, for example, a vertical segment s inserted into the planar map. 
This segment corresponds to a degenerate vertical X_trapezoid in the search 
structure. It is treated as if it were non-degenerate by performing a symbolic 
shear transform on the input, (x, y) ^ (x + ey, y), for e so small (Vr G M, e < r) 
that the order on the x coordinates is linear. This scheme allows to distinguish 
between the vertical attachments bounding the segment s. 

Bounding Box A Bounding box is a standard tool used to enable dealing with 
infinite objects. The idea is to keep the ordinary data structures used by the 
algorithm alongside with an additional data structure called a bounding box that 
contains in its interior all the “interesting” points of the map (e.g., segments’ 
endpoints). This enables us to deal with infinite objects as if they were finite. 
In our framework the infinite objects are the active X_trapezoids representing 
infinite portions of the plane and the interesting points are the endpoints of the 
input curves as well as the intersections of the vertical attachments emanating 
from them with the curves. 

Consider a vertical decomposition algorithm that uses a bounding box. At the 
beginning we have a unique active X_trapezoid representing a fixed bounding 
box. Whenever the property that all the interesting points are in the current 
bounding box becomes invalid the bounding box is enlarged to remedy this. This 
is done before the update or point location query takes place. 

As a result all operations can take up to 0(n) where n is the number of 
already inserted x-monotone curves. This is due to the possibly large (up to 
linear) number of X_trapezoids that are resized in a single bounding box update. 
The way we overcome this problem is by using a symbolic representation of 
infinity instead of a bounding box. For each X_trapezoid we keep the geometric 
information about its boundedness. 

Analysis The algorithm offers on-line insert_edge, delete (remove_edge) , 
split_edge and merge_edge operations with expected time of 0{log‘^{n)) where 
n is the number of update operations (insertions and deletions of curves) under 
the model suggested by Mulmuley ^^3- The expected storage requirement is 
0(n). 

It was also shown that the update time is dominated by the point location 
part of the operation, and that other than the point location each update takes 
expected 0(1) time. 

4.2 Walk-Along-a-Line 

Unlike the naive strategy which traverses over all the edges of the map, the walk 
strategy starts with the unbounded face as the current face (unless the map is 
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empty and we are done) and finds the hole that contains the query point q. If 
no such hole exists we know that the q belongs to the current face. Otherwise 
we find the halfedge on the current face boundary that is vertically closest to q. 
If the face incident to this halfedge does not contain q, we “walk” towards the 
query point q along the vertical ray emanating from q passing from a face to 
its adjacent neighbor along this ray, till the face containing the query point is 
located. This simple procedure is performed recursively until we cannot continue. 
It follows that the last face in the procedure is the required one. 

Unlike other implementations of the walk algorithm (see for example 0), 
the planar map does not necessarily deal with line segments. Our implementa- 
tion assumes a;-monotone curves, and does not assume an intersection predicate 
between lines and an x-monotone curve (this predicate is not part of the traits 
requirements). This is why we need to walk specifically along a vertical ray. For 
this reason we cannot store a starting point on the unbounded face, from which 
to start the walk, but need to find one which is vertically above the query point 
for every query. Since our package supports holes the query point can be in a 
hole inside a face. We would not like to walk over all the holes inside the faces 
we encounter, only through the ones that contain the point. This is achieved in 
our algorithm as described in the previous paragraph — we go over the holes 
of a face / only if we know that the query point is contained inside the outer 
boundary of /. 

Whenever we use exact traits the computation is robust. Nevertheless, we pay 
special attention to degenerate cases such as walking over a vertical segment or 
having the query point on a vertical segment. 

Remark: Due to the fact that the walk class has no internal data structures it 
suites perfectly for debugging purposes. Any users writing their own strategies 
can maintain the walk strategy alongside with their own class and use it for 
validity verifications. 

4.3 Experimental Results 

Figure 0 shows construction tim(0of a planar map defined by segments with 
different traits classes using the default strategy. The input was constructed us- 
ing a sample of random segments with integer endpoints which were intersected 
using exact arithmetic. Then a conversion to the different number types (e.g., 
the built-in double) was done. It can be seen that the choice of traits strongly 
influences the running time. As expected, construction with floating point arith- 
metic is fastest, and the LEDA traits (see SectionQ perform very well compared 
with a Cartesian representation with a rational number type (in this case the 
ledajrational typ^. suggests that using other representations such as ho- 
mogeneous coordinates with the leda_integer number type or Cartesian with 

® All experiments were done on a Pentium-II 450MHz PC with 528MB RAM memory, 
under Linux. 

^ The reader should not confuse the ledajrational number type with LEDA’s rational 
geometry kernel which uses a different representation, employs floating point filters, 
and is used by us in the LEDA traits. 
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number of segments 



Fig. 5. Construction time with different traits 



the leda_real number type (or other filtered number types) will give better re- 
sults. It should be noted that on some of the larger inputs the program crashed 
when using floating point arithmetic (as noted in Section Q we do not guaran- 
tee robustness or correctness when using floating point arithmetic), therefore we 
did not compare construction of a map with more than 4000 segments in this 
experiment. 

Figure0demonstrates construction time with different point location strate- 
gies. We used the LEDA traits (see Section Q for the experiments to maintain 
robustness with an acceptable running time. It can be seen that the point loca- 
tion query in the insertion function dominates the construction time. Therefore, 
the default strategy has much faster construction time than the naive and walk 
strategies. We do not show the results for the naive strategy beyond 4000 seg- 
ments because the running time becomes very big. It can also be seen that there 
is no noticeable difference in the construction times when we disabled the rebuild 
option. This depends, of course, on the thresholds used for the rebuild option. 
We plan to experiment more with different thresholds for the default strategy. 

Figure0shows average deletion time with different point location strategies, 
using the LEDA traits. We can see that the walk strategy performs this operation 
efficiently (as does the naive strategy which is not shown in the figure). Since 
they have no internal search structure to update, time is spent only on updating 
the topological map after the deletion. For small faces this takes constant time. 
The default strategy (with and without the rebuild option) needs to update its 
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Fig. 6. Construction time with different strategies using the LEDA traits; the 
graphs for the default strategies with and without the rebuild option are almost 
overlapping 

X 10"^ 




Fig. 7. Average deletion time with different strategies using the LEDA traits 
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internal structure in addition to updating the topological map. Therefore its 
deletion time is considerably greater. When using the default strategy without 
the rebuilding option the average deletion times are slightly better because some 
rebuilding steps have taken place which effect the average deletion time. The 
average deletion time is greater for the smaller inputs because more edges that 
were incident to the unbounded face (which has high complexity) were deleted. 
Since the deletion time depends on the complexity of the faces incident to the 
deleted edge, this increases the average deletion time. 
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Abstract. Most geometric algorithms are formulated under the non- 
degeneracy assumption which usually does not hold in practice. When 
implementing such an algorithm, a treatment of degenerate cases is nec- 
essary to prevent incorrect outputs or crashes. One way to overcome 
this nontrivial task is to use perturbations. In this paper we describe a 
generic implementation of efficient random linear perturbations within 
Cgal and discuss the practicality of using it examining the convex hull 
problem, line segment intersection and Delaunay triangulation. 



1 Introduction 

Implementing geometric algorithms is a difficult and errorprone task [14]. One 
reason for this is that most of the existing algorithms are described for non- 
degenerate input to simplify presentation. However, using input data from real 
world applications or random input, degenerate cases are very likely to occur. 
When implementing such algorithms, we are faced with the problem to identify 
and treat degenerate cases which leads to additional coding and often lets the 
structure of the program deteriorate. If one simply doesn’t care about degenerate 
cases one is often faced with incorrect output or crashes. Another approach to 
deal with degeneracies which is often used in papers to state that the result also 
holds for general inputs, is the method of perturbation which suggests to move 
the input by an infinitesimal amount such that degeneracies are removed. More 
or less general perturbation methods that have been proposed are Edelsbrunner 
and Miicke’s Simulation of Simplicity scheme (SOS)) [11, 17], Yap’s symbolic 
scheme [23, 22], the efficient linear scheme of Canny and Emiris [6, 7], the ran- 
domized scheme of Michelucci [16] and the scheme for Delaunay triangulations 
of Alliez et al. [1]. For an excellent survey consult the paper of Seidel [21]. 

The question of usability of the perturbation approach divides the computa- 
tional geometry community and the opinions changed over the years. In the late 
eighties Yap called perturbations “a theoretical paradise in which degeneracies 
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are abolished” and Edelsbrunner and Miicke believed that their SOS scheme 
will become standard in geometric computing. In the mid nineties, driven by 
increasing experience in the implementation of geometric algorithms more and 
more opinions questioned the practical use of perturbations [5, 20] . Burnikel et 
al. [5] argued that a direct thought treatment of degenerate cases is much more 
efficient in terms of runtime and only moderately more complicated in terms of 
additional coding. They argue that this results from the overhead of computing 
with perturbed objects and the need of a nontrivial postprocessing step to retain 
the original nonperturbed solution. Additional overhead may result from output 
sensitive algorithms like most algorithms for segment intersection. Considering 
input segments that all intersect in one point, the perturbed version would have 
to detect n{n — l)/2 intersections. We come back to these objections in our 
experimental section. 

In this paper we describe a generic implementation of random linear per- 
turbations (based on [21]) within the computational geometry software library 
Cgal [8, 12]. It enables the user to perturb the input objects and hence being 
able to only code the original algorithm without bothering about degeneracies. 
Contrary to previous implementations of perturbation schemes [17, 7], this is 
the first general and easy to use implementation requiring just to perturb the 
input rather than each test function. 

After briefly reviewing the theoretical framework, we discuss our generic imple- 
mentation of perturbations and Anally illustrate and evaluate its practical use by 
examining three basic examples in computational geometry: convex hulls, line 
segment intersection and Delaunay triangulation. 

2 An Efficient Linear Perturbation Scheme 

In this section we will briefly review the efficient linear perturbation scheme as 
discussed in [21]. Proofs are omitted due to lack of space and can be found in [21] 
or [9]. 

The goal of the perturbation method is the following: Given a program for 
a certain problem that works correctly for all non-degenerate inputs, we want 
a purely syntactical transformation into a program that works correctly for all 
possible inputs. 

More formally speaking: We are given a function F (describing an algorithm) 
from some input space I to some output space O. We will think of J as {n 
points in if we take N = dn) and of O as IR^ like this is the case in most 
geometric applications. 

Definition 1. For an input q € a linear perturbation of q is a linear curve 
TTg starting in q. It is a continuous mapping tt^ : [0,oo) ^ IR^ with 7Tq(e) = 
q + eaq, where 0 yf G IR^. A linear perturbation scheme Q assigns each input 
q € IR^ a linear perturbation tt^ . 

Definition 2. A linear perturbation scheme Q induces for every function F : 
IR'^ I— > IR^ a perturbed function : IR^ ^ IR^, defined by (q) = 

lim,,^o+ ^(’’■^(e)). 
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We will assume that this limit exists and sometimes just write F for the per- 
turbed function. When do F and F agree ? 

Lemma 1. If F is continuous at q, then F{q) = F{q). 

The Lemma gives us a simple condition. However, if F is not continuous at 
q, we hope that there is some reasonable relationship between F{q) and F(q), 
because the whole idea of the perturbation method is to compute F{q) instead 
of F{q), thus being able to neglect degenerate cases. 

Consider two examples: The problem of computing the convex hull area 
(CHA) and the convex hull sequence (CHS) of a set of points. CHA is continuous 
everywhere, so the result of the perturbed program CHA is always identical to 
the original result. Unfortunately this is not the case for CHS which is discon- 
tinuous for inputs with more than two collinear points on an edge of the convex 
hull (see Fig. 1). However, in this case CHS{q) is a subsequence of CHS{q), thus 
the original output can be recovered quite easily. Not all discontinuous functions 
admit such an easy postprocessing step. We will come back to that problem in 
the experimental section. 
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95 



CHS with degenerate input 
Output: (gi,g2,(?4,(?5,g6) 



'93 




95 



CHS with same input 
Output: (gi, 92 , 53,94, 95,96) 



Fig. 1. CHS is discontinuous. 



Now we want to turn to the computation of F. Therefore we first choose a 
model of computation. A geometric algorithm A can be viewed as a decision 
tree^ T where the decision nodes test the sign (-1-,— or 0) of some test function 
(usually a low degree polynomial like geometric primitives as the orientation test 
or in-circle test) of the input variables. An input q G is called degenerate if 
the computation of A on input q contains a test with outcome zero. 

Assume now, we have a decision tree T that computes some function F. 
How is it possible to compute the perturbed function for some perturbation 
scheme Q ? We simply perform a perturbed evaluation of T. 

Definition 3. Let f : i-^ IR &e continuous, let q G and let tt, be a 

perturbation of q. We say iTg is valid for f iff sign /(7Tq(e)) yf 0 . 

A precise definition can be found in [21]. 
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A perturbation scheme Q is valid for f iff tt^ is valid for f for each q G . If 
T is a family of test functions, then we call a perturbation (scheme) valid if it 
is valid for each f G T ■ 




\ 



1 




valid perturbation 



non-valid perturbation 



Fig. 2. A valid and a non-valid perturbation of three collinear points. 



This means that we don’t have to care about the 0-branches if we have a 
valid perturbation. The following theorem encompasses all benefits: 

Theorem 1. Let T be a correct decision tree computing some function F : 
O, and let Q be a perturbation scheme that is valid for the set of test 
functions appearing in T. 

1. A perturbed evaluation ofT computes the perturbed function . 

2. If F is continous at q, then the perturbed evaluation of T with input q yields 

F{q)- 

3. The above statements remain true, if some, or all, of the 0-branches of T 
are removed. 

Now it remains to show how we actually get valid perturbations for each possible 
input and how to evaluate the perturbed test functions: Linear perturbations 
are interesting since they allow relatively simple evaluation of perturbed test 
functions: 

Theorem 2. Let f be a multivariate polynomial of total degree at most D, and 
let Bf be a “black box algorithm” computing f. Let iTg be a linear perturbation 
of q that is valid for f. Then lim(,^o+ sign{f{TTq(e))) can be determined using at 
most D -G 1 calls to Bf plus a small overhead. 

We are left with the problem how to actually come up with a valid perturbation 
for all inputs: It turns out that choosing the perturbation direction Oq randomly 
results in a valid perturbation for all inputs with very high probability as this 
result in [21] suggests: 

Theorem 3. Let T be a decision tree with a set F of S different test functions, 
each a multivariate polynomial of total degree at most D, and let q G IR'^ be a 
fixed input to T. If direction a is chosen uniformly at random from {1, 2, . . .m\^ , 
then the linear perturbation 7Tq(e) = q ea fails to be valid with probability at 
most DS/m. 




An Easy to Use Implementation of Linear Perturbations within Cgal 



173 



Now we know how to come up with a valid perturbation of the input. Of course, 
if a randomly chosen direction a turns out to be bad, i.e. during the evaluation 
of T a 0-branch is taken, we have to abort the computation and restart with a 
new randomly chosen a. 

A deterministic construction of a valid linear perturbation seems to be rather 
difficult for the general case [7, 21]. 

3 An Implementation of Pertnrbations within Cgal 

In this section we will describe how we implemented the perturbation approach 
of the last section within Cgal. The solution is surprisingly simple since Cgal 
uses the generic programming paradigm. All geometric objects of the Cgal ker- 
nel are parameterized by a number type NT. Hence, we developed a new number 
type CGAL_Epsilon_polynomial<NT> representing the linear perturbations while 
offering polynomial arithmetic^ over NT. This frees us from identifying and per- 
turbing every test function^ since now all computations and comparisons are 
perturbed. The random perturbation direction (a random number between 1 
and MAXINT) is assigned to the components of a geometric object p by calling 
the function CGAL^ierturbCp) . 




Fig. 3. Computation dag for x ■ y — v ■ w if x is the e-polynomial 5 -T 3e, y is 3 -T 8e, v 
is 6 -I- e, and w is 4 -|- 7e. 



The e-polynomials are represented with a vector of their coefficients. Initially 
we have linear polynomials with the original coordinate as constant coefficient 
and the random perturbation direction as linear coefficient. The sign of an e- 
polynomial is the sign of the first nonzero coefficient. 

^ Polynomial addition, subtraction and multiplication as well as scalar operations are 
available (division is avoided in Cgal using CGAL_Quotient<NT>). 

® A shortcome of previous methods and implementations [11, 17, 7]. 
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Since the sign of an e-polynomial is often determined by the constant coeffi- 
cient or by low degree coefficients, we chose to pursue a lazy evaluation approach 
by recording its computation history in a directed acyclic graph and determin- 
ing non-constant coefficients only once they are needed from the computation 
dag. An arithmetic operation simply constructs a new node in the graph rep- 
resentation, computes the new constant coefficient, determines the new degree, 
establishes pointers to subexpressions and labels the node with the type of the 
arithmetic operation (see Fig. 3). Comparisons are reduced to sign computa- 
tions. The sign is determined by first looking at the (nonperturbed) constant 
coefficient. If this is zero, the linear coefficient is determined recursively from 
the subexpressions^, and so on until we have reached a nonzero coefficient or 
the maximum degree of the polynomial. In that sense, the sign computation is 
adaptive, easy sign computations are fast, difficult ones are more expensive. If 
the leading coefficient is also zero, we have the case of a non-valid perturbation 
scheme, thus we throw an exception. This enables the user to catch it at runtime 
and restart the computation with new random perturbations. 

The random perturbation coordinate is randomly chosen from the interval 1 
to MAXINT. The bitlength of the perturbed coordinate may be chosen smaller 
than the bitlength of MAXINT®, as is for example necessary when we use dou- 
bles with limited bitlength to assure exact computation. In such a case, the 
perturbation has to be limited similarly. 

A typical example setting is the following: 

typedef leda_integer NT; 

typedef CGAL_Point_2<CGAL_Homogeneous<NT> > POINT ; 

typedef CGAL_Point_2<CGAL_Homogeneous<CGAL_Epsilon_polynomial<NT> 

> > ePDINT; 



leda_list<POINT> pL; // list of input points pL is given 
leda_list<ePOINT> perturbed_pL; 

POINT p; 
foralKp, pL) 

perturbed_pL . append(CGAL_perturb(p) ) ; 

Now we can do the same things with perturbed points as with normal points, 
we may compare them, test them for collinearity and so on. The outcome of these 
test functions with perturbed points in general won’t be zero. If we encounter a 
zero outcome then our random perturbation was bad and we have to start over. 
However, there is another possibility for a zero outcome, e.g. if we make a local 
copy of a perturbed input point and test both for equality®. Thus the reason for 
an exception or warning that is automatically produced encountering a zero test 
has to be carefully studied. 

^ Each recursively computed coefficient is stored in the coefficient vector thus does 
not have to be recomputed. 

® The bitlength of MAXINT is 32 in our experiments. 

® In Cgal and Leda points and segments are handle types, i.e there is a pointer to 
their representation. Testing for identity of the pointers eliminates this undesired 
zero outcome. 
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4 Experiments 

To evaluate the practical use of our implementation of perturbations we will 
discuss three examples: computing the convex hull of a planar point set, com- 
puting the intersections of line segments and computation of the Delaunay tri- 
angulation of a planar point set. We take efficient, well designed algorithms of 
Leda [15, 13]^ and Cgal for those problems which work on all input instances 
by treating degenerate cases. We will compare these algorithms with the cor- 
responding perturbed ones without the treatment of degeneracies in terms of 
running time® and simplification of the code. Since perturbations assume the 
use of exact arithmetic, we compare the use of doubles with suitably restricted 
bitlength and the type integer of Leda to represent the homogeneous coordi- 
nates of points and segments. 



4.1 Convex Hulls 

Given a set S of points in the plane, its convex hull is the smallest convex set 
containing S. A natural representation is a minimal cyclic list of the vertices 
on the hull. The possible degenerate cases are the identity of two points or the 
collinearity of three points. We take two algorithms from Leda for this problem 
(see [15, 13]), one based on sweep (a modification of Graham’s scan) and one 
based on randomized incremental construction®. The highlevel Leda code for 
the sweep is 28 lines long where 7 lines deal with degeneracies. The RIG is 
slightly more involved and consists of 106 lines of code, 19 of them dealing with 
degeneracies (see [15] and [10]). In our perturbed versions we can omit those 
lines since we use a list of perturbed points. Planar convex hulls are an easy 
example where it is not very difficult to handle degenerate cases. Hence there is 
only a modest gain in terms of lines of code and structure of the code. Figure 4 
shows the price we have to pay for using perturbed points^® instead of normal 
points: 

The coordinates of the points had been randomly chosen from [0, . . ., 10000] 
which results in few degeneracies. To enforce degeneracies we also performed 
experiments where the random coordinates had been rounded to a grid with 
gridwidth 100 and 500 respectively (the two dashed lines). 

In the first two diagrams we used doubles to represent the coordinates. For 
normal random input (solid lines) this results in a large overhead factor of around 

^ Cgal provides a large number of geometric algorithms but not yet an algorithm for 
line segment intersection, so we CGALized the existing Leda code. 

® All experiments measuring CPU time in seconds on a Sun Enterprise 333MHz, 6Gb 
RAM running Solaris 2.6 using Leda 3.7.1 and Cgal 1.2 compiled with g-l— I- -02 
® Cgal also offers a number of implementations of different convex hull algorithms [18] 
where degeneracies are handled implicitly in most cases. 

The input points are now polynomials of degree one, so operations on polynomials 
replace the previous operations on doubles and integers respectively. The maxi- 
mum degree in intermediate computations is 2 (in the left/right-turn and orientation 
primitives). 
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DIAGRAM 1 : Overhead factor of CHSWEEPp with different degeneracy levels using 1 3bif doubles 




number of points 



DIAGRAM 2: Overhead factor of CHRICp with different degeneracy levels using 13bit doubles 




number of points 



DIAGRAM 3: Overhead factor of CHSWEEPp with different degeneracy levels using 52bit integers 




500 1O00 1500 2000 2500 3000 3500 4000 4500 5000 

number of points 



DIAGRAM 4: Overhead factor of CHRICpI with different degeneracy levels using 52bit integers 




Fig. 4. Overhead factor between the original and perturbed convex hull algorithms for 
random points in [0, . . . , 10000]. 
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40 for the perturbed sweep (Diagram 1) and an even larger overhead factor of 
around 65 for the perturbed RIC (Diagram 2) . This huge performance loss arises 
since the arithmetic part of the convex hull algorithms is very large such that 
the computation with polynomials over doubles incurs a significant slowdown 
compared to the extremely fast computation with doubles. 

The overhead rises further if we take more degenerate inputs (dashed lines) . This 
stems from the fact that we now have many identical and collinear points that 
may let the intermediate hull of the perturbed algorithms grow but not that of 
the unperturbed ones. Moreover the sign computations are more “difficult” now 
for the perturbed case. 

Diagrams 3 and 4 show the same experiments where the point coordinates 
had been “pumped up” to 52 bits. We now have to use exact integers for the 
coordinates to obtain the correct result. 

For normal random inputs (solid lines), the overhead of the perturbed sweep 
(Diagram 3) is reduced to a constant factor of 2.8, whereas the perturbed RIC 
(Diagram 4) has an constant overhead factor of around 2.2. This can be ex- 
plained as follows: The use of exact integers increases the running time of the 
arithmetic part in both the perturbed and the unperturbed versions. Since the 
random perturbation direction has bitlength at most 32, not all operations are 
between 52-bit numbers in the perturbed versions, so the overhead of polynomial 
arithmetic is reduced. 

If we look at highly degenerate inputs (dashed lines), we again observe higher 
overheads of up to 7 for the sweep and around 2.6 for the RIC. In the RIC the 
difference between random and highly degenerate input is not as pronounced as 
in the sweep since the points are taken into consideration in random order^^. 

Postprocessing to retain the original nonperturbed solution takes neglectable 
time since the number of points on the hull is expected to be logarithmic in the 
total number of points in this case (only for points on a circle it takes about 
as long as preprocessing, i.e. perturbing the points) but the savings in terms of 
lines of code are lost. 

4.2 Line Segment Intersection 

Now we turn to another basic problem in computational geometry, the line seg- 
ment intersection problem. We investigate the performance of the Leda imple- 
mentation of the Bentley-Ottmann sweep that can cope with degeneracies and 
takes 0((n -I- s) logn) time (where s is the number of intersections). In contrast 
to convex hulls, there are a large number of possible degenerate cases: Multi- 
ple intersections, overlapping segments, segments sharing endpoints, segments 
degenerated to a single point, vertical segments and endpoints or intersection 
points with same a;-coordinate^^. 

Before adopting the lazy evaluation approach we simply performed the “complete” 
operation when computing with e-polynomials. The lazy evaluation approach re- 
duced the overhead by around 80% for random inputs but only slightly for highly 
degenerate inputs. 

The latter two are often called algorithm induced degeneracies. 
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A perturbed version of the algorithm is easily obtained using a list of per- 
turbed segments as input. Since the algorithm is more complicated it is not 
straightforward to identify the treatment of degenerate cases that we may re- 
move. Therefore we implemented an own perturbed version of the sweep which 
saves 71 lines of 201 and is easier to understand since we followed the original 
formulation (see [15] and [10]). Hence, the savings in terms of lines of code and 
development time are quite significant. 

What penalty do we have to pay for this? The maximum degree of the e- 
polynomials is 5 in the sweep (intersection computation, orientation test of in- 
put segment with endpoint or intersection point and using homogeneous coordi- 
nates). Figure 5 shows the results of our experiments for segments with random 
coordinates in [0, . . ., 1000] (a grid rounding (for gridwidth with 20 and 50 re- 
spectively) has again been used to obtain varying levels of degeneracy) . 

In the first diagram we used doubles to represent the coordinates. For nor- 
mal random input (solid line) we observe an overhead factor of about 9.5. The 
overhead using doubles is much smaller as in the convex hull algorithms since 
the arithmetic part of the sweep is not so dominant. 

To assure exact computation we had to limit the perturbation direction to 
10 bits. This sometimes resulted in zero signs despite perturbation, thus we had 
to restart the computation. From a certain input size on it was not possible 
anymore to come up with a valid perturbation scheme for highly degenerate 
inputs . 

In Diagram 2 we pumped up the coordinates to 52-bits and used integers 
to represent them. The overhead factor is smaller again: 

For normal random input (solid line) the overhead factor is about 4.1. Using 
highly degenerate inputs (dashed lines) the overhead grows to about 4.5. This 
comes from the fact that in this case we have many overlapping segments and 
many multiple intersection points which are “better” for the output sensitive 
original sweep; moreover we again get more difficult sign computations in the 
perturbed case. 

Experiments with n segments all intersecting in one point showed us that the 
perturbed sweep performs badly since it has to detect n{n — l)/2 intersection 
points, whereas the original output-sensitive version takes 0{n log n) and is faster 
than usual [10]. This drawback of the perturbation approach was already pointed 
out in [5]. 

Postprocessing for the perturbed segment sweep is nontrivial since the pertur- 
bation may cause an intersection to vanish. We implemented a “postprocessing” 
step that returns the (unperturbed) list of intersection points. This requires 
to test whether endpoints of segments lie within another segment during the 
sweep and real postprocessing (unperturbing the intersection points and elimi- 
nating duplicates) after the sweep . Apart from additional coding the running 
time overhead factor increases slightly for random input (around 15%) and more 
drastically for highly degenerate inputs (up to 70%) where the difference of the 
number of intersections is the perturbed and the original version is very high. 
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DIAGRAM 1 : Overhead factor of SEGINTp using lObit doubles 




number of segments 



DIAGRAM 2: Overhead factor of SEGINTp with different degeneracy levels using 52bit integers 




number of segments 




200 400 600 800 1000 1200 1400 1600 1800 2000 

number of points 



DIAGRAM 4: Overhead factor of DEL_TRp using 52 bit integers 




200 400 600 800 1000 1200 1400 1600 1800 2000 

number of points 



Fig. 5. Overhead factor between the original and perturbed sweep algorithm for ran- 
dom segments in [0, . . . , 1000] and between the original and perturbed Delaunay Tri- 
angulation algorithm for random points in [0, . . . , 1000]. 
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If we are interested in the intersection graph, postprocessing is even more chal- 
lenging [3]. 



4.3 Delaunay Triangulation 

The Delaunay triangulation of a planar point set is a special triangulation max- 
imizing the minimum angle of the triangles. A triangle of the Delaunay trian- 
gulation does not contain another point in its circumcircle, hence if we want to 
compute the Delaunay triangulation we need the in-circle test for 4 points which 
is a degree 4 predicate. 

In our experiments we used the CCAL-implementation of Delaunay triangu- 
lations which is based on flipping. Whereas the case of cocircular points is not 
really a problem here, a careful treatment of degenerate cases regarding collinear 
and identical points was necessary for the “march- locate” step [24]. We didn’t 
implement a perturbed version of the flipping algorithm but believe that at least 
10% of the code can be saved. 

It can be seen in Diagram 3 of Fig. 5 that the overhead factor is around 65 
for doubles^^. The reason for the huge overhead is that the arithmetic part of 
the Delaunay triangulation algorithm is very large. 

Using pumped up exact integers the overhead factor is reduced to about 
2.5 for random input. Highly degenerate inputs have a larger overhead factor of 
up to 4 since the sign computations are more difficult. 

Postprocessing can be done efficiently here by shrinking degenerate triangles 
resulting from identical points and removing degenerate triangles on the hull 
resulting from collinear points. 

A Note on the Experimental Setting. The huge overhead factor for doubles may 
seem to indicate that our perturbation implementation is not practical, however, 
going over from doubles to an exact number type like integers without Altering 
results in similar overheads [19]. 

Input coordinates of 52 bits may also seem artificial at first glance but there 
are algorithms where this can occur, e.g. the crust algorithm [2] for curve re- 
construction first computes the Voronoi diagram of the sample points and then 
a Delaunay triangulation of the Voronoi vertices and the sample points. Hence 
even when starting with 10 bit input points, the input for the Delaunay trian- 
gulation may be 40 bits long. 

The overhead factor for perturbation using integers with bitlength 20 (which 
is realistic and requires exact computation when using predicates of degree > 3) 
is only 20-40% larger as in our 52 bit experiments (see [10]), hence we can really 
speak of a medium overhead factor when using our perturbation implementation 
with a number type providing exact computation^^. 

The bitlength of the perturbation direction has to be limited to 13 to assure exact 
computation. This resulted in a few zero signs for highly degenerate inputs. 
Experiments with reals [4] show similar results [10]. 
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5 Summary and Conclusion 

We presented a generic and easy to use implementation of the linear perturba- 
tion approach of [21] within Cgal. Using our implementation, it is possible to 
perturb the input which abolishes the degenerate cases^®. 

As we have seen in the three applications, this introduces a medium overhead 
factor for the running time which depends on the runtime fraction of the arith- 
metic part of an algorithm as well as on the used number type. The performance 
on highly degenerate inputs increases even more. 

We face additional problems if we want to have the original result and not the 
perturbed one, as it is the case when our (perturbed) implementation is part 
of a larger system and involves interaction between other components. The nec- 
essary postprocessing step often amounts to comparable work as treating the 
degenerate cases in the first place. 

We conclude that the perturbation approach using our e-polynomial imple- 
mentation is an important tool for rapid prototyping of geometric algorithms. It 
enables us to implement difficult algorithms in quite reasonable time if we don’t 
care about a medium runtime penalty. We hope that this might be an aid that 
more of the theoretical work in computational geometry will find its way into 
practice. However, if we have the original non-perturbed output in mind or if we 
cannot afford overhead and don’t care about the time for the implementation 
process, it will be necessary to code a stable nonperturbed version. 

More details of the implementation and the experiments can be found in [9, 
10]. The e-polynomial code including documentation and demo programs can 
be downloaded from http://www.mpi-sb.mpg.de/~mark/perturbation- it will be 
included in a future release of Cgal. 
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Abstract. We study cache effects in distribution sorting algorithms. We 
note that the performance of a recently-published distribution sorting 
algorithm, Flashsortl which sorts n uniformly-distributed floating-point 
values in 0{n) expected time, does not scale well with the input size n 
due to poor cache utilisation. We present a two-pass variant of this al- 
gorithm which outperforms the one-pass variant and comparison-based 
algorithms for moderate to large values of n. We present a cache analysis 
of these algorithms which predicts the cache miss rate of these algorithms 
quite well. We have also shown that the integer sorting algorithm MSB 
radix sort can be used very effectively on floating point data. The al- 
gorithm is very fast due to fast integer operations and relatively good 
cache utilisation. 



1 Introduction 

Most algorithms are analysed on the random-access machine (RAM) model of 
computation Q, using some variety of unit-cost criterion. In particular, the 
RAM model postulates that accessing a location in memory costs the same as a 
built-in arithmetic operation, such as adding two word-sized operands. However, 
over the last 20 years or so CPU clock rates have grown explosively, with an 
average annual rate of increase of 35 — 55% [^. As a result, nowadays even entry- 
level machines come with CPUs with clock frequencies of 400 Mhz or above. 
Unfortunately, the speeds of main memory have not increased as rapidly, and 
today’s main memory typically has a latency of about 70ns. Hence, a conservative 
estimate is that a memory access can take 30-1- CPU clock cycles. 

In order to overcome this difference in speeds, modern computers have a 
memory hierarchy which inserts multiple levels of cache between CPU and main 
memory. A cache is a fast associative memory which holds the values of some 
main memory locations. If the CPU requests the contents of a memory location, 
and the value of that location is held in some level of cache, the CPU’s request is 
answered by the cache itself (a cache hit)] otherwise it is answered by consulting 
the main memory (a cache miss). A cache hit has small or no penalty (1-3 cycles 
is fairly typical) but a cache miss is very expensive. 

* Supported in part by EPSRC grant GR/L92150 
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Nowadays a typical memory hierarchy has CPU registers (the highest level 
of the hierarchy), LI cache, L2 cache and main memory (the lowest level). The 
number of registers and the size of caches are limited by several factors including 
cost and speed [^. Normally, the LI cache holds more data than CPU registers, 
and L2 cache much more than LI cache. Even so, L2 cache capacities are typically 
512KB to 2MB0 which is considerably smaller than the size of main memory. 

It should be added that the memory hierarchy continues beyond main mem- 
ory to disk storage [^. As the cost of servicing ‘cache misses’ in this context is 
not included in the running times, we are not primarily concerned with these 
levels of the hierarchy. However, most CPUs provide hardware support for man- 
aging these levels of the hierarchy in the form of a translation look-aside buffer 
(TLB) [^. The TLB can affect running times, as a cache miss may require two 
memory accesses to be mad(fl Hence a cache miss can easily cost 60-1- cycles. 

Programs have much faster running times if they have a high cache hit rate — 
i.e. if most of their memory references result in a cache hit. By tuning an al- 
gorithm’s cache performance, one can obtain substantial improvements in the 
running time of its implementations, as we demonstrate in the context of distri- 
bution sorting algorithms. 

Distribution sorting is a popular technique for sorting data which is assumed 
to be randomly distributed, and involves distributing the n input keys into m 
classes based on their value. The classes are chosen so that all the keys in the ith 
class are smaller than all the keys in the (z -I- l)st class, for z = 0, . . . , m — 2, and 
furthermore, the class to which a key belongs can be computed in 0(1) time. In 
0(n) time, the problem is reduced to sorting the m classes. Distribution sort- 
ing can be used to sort randomly distributed keys in 0{n) time on average Q 
Ch 5.2, 5.2.1], which is asymptotically faster than the O(rzlogn) running time of 
comparison-based approaches. Neubert Q presented an implementation of a dis- 
tribution sorting algorithm Flashsortl, which used a combination of a well-known 
counting method and an in-place permutation method similar to one described in 
Q Soln 5.2-13]. With Neubert’s choice of parameters Flashsortl uses only rz/10 
words of memory in addition to the memory required by the data, and sorts 
n uniformly (and independently) distributed floating-point keys in 0{n) time. 
Neubert’s experiments showed that his implementation of Flashsortl was twice 
as fast as Quicksort when sorting about 10,000 keys. In the RAM model, the 
lower asymptotic growth rate of Flashsortl and the fact that Flashsortl outper- 
forms Quicksort for n = 10, 000 would indicate that Flashsortl would continue 
to outperform Quicksort for larger values of n. Unfortunately, this is not the 
case. We translated Neubert’s fortran code for Flashsortl into C and per- 
formed extensive experiments which clearly indicated that although Flashsortl 
was significantly faster than Quicksort for n in the range 4K to 128K, Quicksort 
caught up with and surpassed Flashsortl at IM keys. Note that Quicksort has 
O(rzlogrz) running time even on average, whereas Flashsortl is a linear-time 

^ K = 1024 and M = 1024K in this paper, except that M in MHz equals 10®. 

^ This happens in case of a TLB miss, i.e. the TLB does not hold the page table entry 
for the page which was accessed Q 
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algorithm. In fact, the ratio of the running times of Flashsortl to Quicksort 
continued to grow with n, up to n = 64M. 

Our analyses and simulations verify that this is due to the poor cache per- 
formance of Flashsortl. By adapting Flashsortl to run in two passes we get an 
algorithm Flashsort2P: because Flashsort2P makes better use of the cache, it 
out-performs both Flashsortl for > IM items, and Quicksort for > 4M items. 
We present a cache analysis of both Flashsort variants, and validate the for- 
mula by simulations. We also show that by using the MSB (most-significant-bit 
first) radix sort Q on the integer representation of floating point numbers we 
can clearly out-perform Quicksort, Flashsortl and Flashsort2P for all values 
of n that we tested. We note that understanding the cache behaviour for the 
Flashsorts does not directly say anything about MSB radix sort. However, our 
simulations show that it maintains a low miss rate, and hence is able to benefit 
from the extra speed of integer operations. 

2 Preliminaries 

This section introduces some terminology and notation regarding caches. The 
size of the cache is normally expressed in terms of two parameters, the block size 
(B) and the number of cache blocks (C). We consider main memory as being 
divided into equal-sized blocks consisting of B consecutively-numbered memory 
locations, with blocks starting at locations which are multiples of B. The cache 
is also divided into blocks of size B\ one cache block can hold the value of exactly 
one memory block. Data is moved to and from main memory only as blocks. 

In a direct-mapped cache, the value of memory location x can only be stored 
in cache block c = {x div B) mod C. If the CPU accesses location x and cache 
block c holds the values from x’s block the access is a cache hit; otherwise it is 
a cache miss and the contents of the block containing x are copied into cache 
block c, evicting the current contents of cache block c. For our purposes, cache 
misses can be classified into compulsory misses, which occur when a memory 
block is accessed for the first time, and conflict misses, which happen when a 
block is evicted from cache because another memory block that mapped to the 
same cache block was accessed. 

An important consideration is what happens when a value is written to a 
location stored in cache. If the cache is write-through the value is simultaneously 
updated in the cache and in the next lower level of the memory hierarchy; if 
the cache is write-back, the change is recorded only in cache. Of course, when 
a block is evicted from a write-back cache it must be copied to the next lower 
level of the hierarchy. 

We performed our experiments on a Sun UltraSparc-II, which has a blocksize 
of 64 bytes, a LI cache size of 16KB and a L2 cache size of 512KB. Both LI and 
L2 caches are direct-mapped, the LI cache is write-through and the L2 cache 
is write-back. As our programs hardly ever read a memory location without 
immediately modifying it, the LI cache is ineffective for our programs and we 
focus on the L2 cache. Hence, for our machine’s L2 cache we have C = 8192. This 
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paper deals mainly with single-precision floating-point numbers and integers, 
both of which are 4 bytes long on this system. It is useful to express B in terms 
of the number of ‘items’ (integers or floats) which fit in a block; hence we use 
i? = 16 in what follows. 

3 Overview of Algorithms 

We now describe the main components of the two Flashsort variants, and explain 
why Flashsortl may have poor cache performance. We then introduce Flash- 
sort2P and MSB radix sort. While describing all these algorithms, the term data 
array refers to the array holding the input keys, and the term count array refers 
to an auxiliary array used by these algorithms. 



3.1 Flashsortl 

Flashsortl has three main phases, the first two of which are a count phase and 
a permute phase. After the count and permute phases, the data array should 
have been permuted so that all elements of class k lie consecutively before all 
elements of class fc -I- 1, for k = 0,...,m — 2. The data array is then sorted using 
insertion sort. FigQ gives pseudo-code for the count and permute phases; it is 
assumed that value of m has been set appropriately before the count phase, and 
that the function classify maps a key to a class numbered {0,...,m — 1} in 
0(1) time. For example, if the values are uniformly distributed over [0, 1) then 
classify (a;) can return [m ■ a;J. 



(a) A count phase 

1 for i := 0 to n— 1 do 

COUNT [classify (DATA [i] ) ] ++ ; 

2 COUNT [m - 1] := n - COUNT [m - 
1 ] ; 

for i := m — 2 downto 0 do 
COUNT [i] : = 

COUNT [i -I- 1] - COUNT [i] ; 



(b) A permute phase 

1 start := 0; nmoved := 0; 

2 idx := start; x := Uk.7k.lidx']; 

3 c := classify (a) ; idx := COUNT [c]; 
swap X and DATA [idx] ; 

nmoved := nmoved + 1; COUNT [c]++; 
if idx 7^ start go to 3; 

4 Find start of next cycle and set 

start to this value. Go to 2. 



Fig. 1. The two phases in both Flashsortl and Flashsort2P. DATA holds the input 
keys and COUNT is an auxiliary array, initialised to all zeros. The permute phase 
terminates whenever in the above: (i) nmoved > n — 1 or (ii) start moves beyond 
the end of the array. 



Fix some class k and let t be the value of COUNT [fc] at the start of the permute 
phase. During the course of the permute phase, an invariant is that locations 
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COUNT [fc] — 1 contain elements of class fc, i.e. COUNT [fc] points to 
the ‘next available’ location for an element of class k. The permute phase moves 
elements to their final locations along a cycle in the permutation; a little thought 
is needed to find cycle leaders correctly. 

If the number of classes is m < n the total expected cost of the insertion sort 
is easily seen to be 0{n^ fm), while the rest of the algorithm evidently runs in 
0{n) time. The algorithm uses m extra memory locations. After experimentation 
Neubert chose m = n/10 to minimise extra memory while maintaining a near- 
minimum expected running time 0. However, for large values of n, it is apparent 
that the cache performance of Flashsortl will be poor: e.g., for n = 64M the 
count array will be approximately 50 times the size of cache, which means that 
all accesses to the count array will be cache misses. 

3.2 Flashsort2P 

It appears from the preceding discussion that we should reduce m in order to get 
an algorithm with better cache performance. In Flashsort2P, the distribution is 
done in two phases. In the first, we use m = ^/nj2, giving classes with about 
2-y/n keys on average. We apply distribution sort again to each sub-problem, this 
time with m = classes. At the end, we have classes with an expected size of 
2 keys, and we sort these using insertion sort. Note that the expected running 
time of Flashsort2P is 0(n). The potential benefits of this approach are: 

• A smaller number of classes in the first phase may lead to a lower number of 
misses; 

• The problems in the second phase will be of size 2^/n; for n < 2^^ the 
expected size of a problem in the second phase will be smaller than 512KB, 
which is our L2 cache size. Hence, each of the sub-problems in the second phase 
should fit entirely into cache, giving few misses in the second phase. We can also 
now perform the insertion sort for a sub-problem as soon as we finish with the 
count/permute step for this sub-problem, avoiding the compulsory misses for 
the global insertion sort of Flashsortl. 

• Flashsort2P has much lower auxiliary space requirements than Flashsortl, 
since the count array is of size O(y^), rather than 6>(n) for Flashsortl. 

• The insertion sort problems are smaller in Flashsort2P. However, the insertion 
sort is only a small fraction of the total running time so this improvement should 
not yield great benefits. 



3.3 MSB Radix Sort 

Most significant bit (MSB) radix sort treats the keys as integers by looking 
at the bit-string representation of the floating point numbers. It is well-known 
that if the floating point numbers are represented according to the IEEE 754 
standard, then the bit strings that represent two floating point numbers have 
the same ordering as the numbers themselves, at least if both the numbers are 
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non-negative [^. In our implementation of MSB radix sort we first distribute 
all numbers according to their most significant r bits, where r = min{ [log n~\ — 
2, 16}, where n is the number of keys to be sorted. The permute phase is similar 
to the Flashsort variants. A sub-problem consisting of n' > 16 keys is then 
attacked in the analogous manner, i.e. we distribute the n' keys based on their 
next most significant r' bits, where r' = min{[logn'] — 2, 16}. Problems of size 
<16 are solved with insertion sort. 

For lack of space we do not describe in detail the major differences between 
MSB radix sort and the Flashsort variants. The main point to be noted is that a 
uniform distribution on floating point numbers in the range [0, 1) does not induce 
a uniform distribution on the representing integers, and we cannot analyse it 
the same way as the Flashsort variants. For example, half the numbers will have 
value in [0.5, 1); after normalisation their exponents will all be 0. This means 
that in the IEEE 754 standard representation of single-precision floating-point 
numbers, half the keys will have the pattern 01111111 in their most significant 
9 bits. However, the algorithm can be shown to be linear-time for random w- 
bit floating point numbers, if some assumptions are made about the size of the 
exponent. 

The above argument also means that after the first pass, there will be several 
large sub-problems (e.g. with r = 16 there may be problems of expected size 
about n/256) which may not fit easily into L2 cache for very large n. Hence MSB 
radix sort does not have as good cache utilisation as Flashsort2P for large n. 
However, we find that MSB radix sort can out-perform Flashsort 1, Flashsort2P 
and Quicksort over the range of values that we considered. 



4 Experimental Results 

We have implemented Flashsortl, Flashsort2P and MSB radix sort, and have 
used them to sort n uniformly distributed floating-point numbers, for n = 2*, 
i = 10, 11, ... , 26. We have also tested a highly tuned implementation of recur- 
sive Quicksort from [^. Our algorithms were coded in C (as was the Quicksort 
implementation) and all code was compiled using gcc 2.8.1. For each algo- 
rithm, we have measured actual running times as well as simulated numbers of 
cache misses. The running times were measured on a Sun Ultra II with 2 x 300 
Mhz processors and 512MB main memory. As mentioned above, this machine 
has a 16KB LI data cache, 512KB L2 cache, both of which are direct-mapped. 
Our simulator simulates only the L2 cache on this machine, and only reports 
cache hit/miss statistics. Each run time and simulation value reported in this 
section is the average of 100 runs. 

Figure Q summarises the running times for Flashsortl, Flashsort2P, Quick- 
sort and MSB radix sort. For the smaller input sizes (n < 512K), the timing 
was obtained by repeatedly copying and sorting a given (unsorted) sequence of 
numbers and taking the average time. The running times reported include the 
time for copying. We observe that: 
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• For small values of n Flashsortl gets steadily faster than Quicksort until it 
uses about 70-75% of the time for Quicksort for about 128K. After that the per- 
formance advantage narrows until at n = IM Quicksort overtakes Flashsortl. 
The gap between Quicksort and Flashsortl grows steadily until the largest in- 
put value we considered. This is interesting given that Quicksort has a higher 
asymptotic running time than Flashsortl. 

• Flashsort2P is slow for small values of n but starts to out-perform Flashsortl 
at n = IM and Quicksort at n = 4M. At large values of n Flashsort2P is almost 
twice as fast as Flashsortl. 

• MSB radix sort out-performs the other algorithms for all values of n shown. 



Timings (s) on UltraSparc-2, single precision keys 


n 


Flashsortl 


Flashsort2P 


Quicksort 


MSB radix 


IK 


0.0004 


0.0006 


0.0004 


0.0004 


2K 


0.0008 


0.0011 


0.0009 


0.0008 


4K 


0.0016 


0.0023 


0.0020 


0.0016 


8K 


0.0035 


0.0049 


0.0044 


0.0033 


16K 


0.0073 


0.0099 


0.0095 


0.0069 


32K 


0.0150 


0.0199 


0.0203 


0.0137 


64K 


0.0313 


0.0402 


0.0429 


0.0276 


128K 


0.0687 


0.0818 


0.0916 


0.0611 


256K 


0.1626 


0.1896 


0.1925 


0.1381 


512K 


0.3840 


0.4077 


0.3930 


0.3022 


IM 


1.0121 


0.9419 


0.8516 


0.6477 


2M 


2.4634 


1.8245 


1.8048 


1.3262 


4M 


5.5477 


3.7342 


3.8523 


2.7178 


8M 


12.630 


7.6996 


8.1271 


5.5562 


16M 


27.335 


15.641 


17.123 


11.490 


32M 


57.912 


32.714 


36.503 


25.166 


64M 


131.01 


66.322 


77.206 


53.493 



Fig. 2. Experimental evaluation of Flashsortl, Flashsort2P, Quicksort and MSB 
radix sort on a Sun Ultra II using single precision floating point keys. 



4.1 Cache Simulations 

We also ran these algorithms on an L2 cache simulator for the Sun Ultra II. 
Figure O compares Quicksort and Flashsortl, while Fig0compares the three 
faster algorithms. These figures show the number of L2 cache misses per key for 
the four algorithms on single precision floating point keys. We observe that: 

• when the problem is small and fits in L2 cache, for n < 64K, the number of 
misses per key are almost constant for each algorithm, these are the compulsory 



misses. 
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• in Flashsortl for n > 128K we see a rapid increase in the number of misses 
per key as n grows, appearing to level off at over 3 misses per key (as we will 
see, virtually every access that could be a miss is a miss). Clearly the misses 
per key are much lower for Flashsort2P, Quicksort and MSB radix sort than for 
Flashsortl. 

• in Quicksort the number of cache misses per key increases very gradually as 
the problem size increases, reaching about 0.75 misses per key at 64M0 

• in Flashsort2P we see a very gradual increase in the misses per key, reaching 
0.75 misses per key at 64M. However, the increase for Flashsort2P is not smooth. 

• in MSB radix sort the number of cache misses per key increases very gradually 
reaching a maximum of 1.4. Other than for small n, the cache utilisation for MSB 
radix sort is much better than Flashsortl. 




Fig. 3. L2 cache misses per key on single precision floating point keys: Flashsortl 
vs Quicksort. Note that the a;-axis is on a log scale. 



5 Cache Analysis of Flashsort 

In this section we analyse the cache misses made by the Flashsort variants. We 
will assume that the input size is an integral multiple of BC (in particular this 
means that we only worry about n > 128K) and that important arrays begin 

For Quicksort the number of misses per key will continue to grow very gradually as 
n grows as it performs O(logn) misses per key, but with a very small constant. 



3 
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Fig. 4. L2 cache misses per key on single precision floating point keys: Flashsort 
2P vs Quicksort vs MSB radix sort. Note that the x-axis is on a log scale. 



at locations which are multiples of B. We also ignore rounding as the effect is 
insubstantial. For the sake of simplicity we will assume that the various phases 
are independent, i.e. the cache is emptied after each phase. This assumption 
causes inaccuracies for small input sizes, where a significant part of the input 
may stay in cache between say the count and permute phases, so we report 
only the predicted values for n > 256K. The analyses make extensive use of the 
idea that if there are k memory blocks b\, . . .,bk mapped to a cache block, and 
in the ‘steady state’ memory block bi is accessed with probability pi, then for 
i = 1 , . . . , fc the probability that memory block bj is currently in the cache is 



5.1 The Count Phase 

Step 1 of the count phase consists of n sequential accesses to the data array and 
n accesses to random count array locations. Q Step 2 is a sequential traversal 
through the count array. In addition to the compulsory misses, there may be 
conflict misses in the first step; these can be analysed by a direct application of 
the results of Q, and we do not go into greater detail here. 

^ Here and later we only count memory accesses which may result in L2 cache misses. 
For instance the statement COUNT [i] : = COUNT [i] + 1 involves two memory ac- 
cesses, one read and one write. However, the write cannot be a cache miss, so we 
ignore it. 



Pj/T,i=lPi 
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5.2 The Insertion Sort Phase 

Given that the problems which need to be solved are of expected size < 10 in 
each of the variants, it is extremely unlikely that any problem will exceed the 
size of the cache. Hence the insertion sort only incurs compulsory misses, and in 
the case of Flashsort2P, even these are avoided (unless n is very large). 



5.3 The Overall Analysis 

In the next section we present an analysis of the permute phase of the Flashsort 
algorithms. The underlying assumptions and parameter choices mean that this 
explains the cache misses in Flashsort 1 as well as the first permute phase of 
Flashsort2P. What remains is the analysis of the second phase of Flashsort2P, 
which is presented in the full paper. As expected, because the size of the problems 
in the second phase is small, the misses in this phase are mostly compulsory 
misses. However, as n gets large, we start getting small but noticeable numbers 
of conflict misses. 



5.4 The Permute Phase 

The permute phase can be viewed as alternating cycle-following with finding 
cycle leaders. During the cycle following phase we make 2n memory accesses 
which may lead to cache misses. This consists in alternating random access to 
the count array (to determine the class of x) with one of m ‘active’ locations in 
the data array — the pointers to the ‘next available’ locations for each class — in 
order to place x. In addition to the compulsory misses, we have the following 
potential conflicts which may occur in the cycle-following phase. 

There are m/B active count blocks, and at most m active data blocks (the 
number of active data blocks may be much smaller than m if the class sizes 
are small). Whenever more than one active block is mapped to the same cache 
block, we potentially have conflict misses. In order to make precise the mapping 
of active blocks to cache blocks, we consider three cases; these cases cover the 
the possible parameter choices in our two Flashsort variants. 



Case 1: m < BC and n/m < B In this case we assume that each data 
block contains p = mj{njB) data pointers (this may be a fraction). There are 
two regions in the cache: region Ri which has only data blocks mapped to it 
and region i ?2 which has both data and cache blocks mapped to it. Each cache 
block in Ri has r = n/{BC) data blocks mapped to it, whereas each block in 
i ?2 has one count array block and r data array blocks mapped to it. Since each 
count array block is accessed with probability B/m, and each data array block 
is accessed with probability p/m = B /n, we have that in region R^'. 

B/m B 

B/m + B/n - T B + m/C 



Pr [Count access in i ?2 is a hit] 



( 1 ) 
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PrfData access in is a hit] = — — — ; (2) 

^ Bim + Bln-T nIm + T 

Again, each access to the first element in a data block must be a miss, and 
we add 1/B compulsory misses. Of the remaining {B — 1)/B fraction of data 
array accesses, 1 — m/{BC) go to region i?i, where the hit rate for data accesses 
is simply whereas m/BC go to i ?2 where the hit rate is given by EqnQ 
Therefore, on average: 



#misses/n = 



B 



1 - 



B 



B-l 

B 



B + mjC ^ 



n/m + T J 



( 3 ) 



where r = n/{BC). For Flashsortl, this formula explains data points for 256K 
to IM as shown in Fig0 



Case 2: m > BC and njm < B We omit the details of this case, which are 
similar to Case 1. The final formula is: 

#missesjn = 1/B + (1 — BC/(2m)) + {{B — 1)/ B) ■ (1 — BC/{2n)) (4) 

For Flashsortl, this formula explains data points for 2M to 64M as shown in 

FigO 



Input size 


256K 


512K 


IM 


2M 


4M 


8M 


16M 


32M 


64M 


Predicted 


0.8073 


1.1495 


1.4106 


1.6895 


1.8604 


1.9458 


1.9885 


2.0099 


2.0206 


Simulated 


0.7927 


1.1695 


1.4249 


1.7330 


1.8931 


1.9826 


2.0080 


2.0191 


2.0587 



Fig. 5. Predicted and simulated miss rates for permute phase of Flashsortl. The 
predicted values include a term l/(2i3) which accounts for misses incurred while 
searching for cycle leaders. The justification for this term is deferred to the full 
paper. 



Case 3: m < BC and n/m ^ B In this case we assume that there is 
at most one pointer per data block. Again, we divide the cache into region Ri 
which has only data blocks mapped to it and region i ?2 which has both data and 
cache blocks mapped to it. We will assume that each block in i ?2 has p = m/C 
active data blocks mapped to it (note that p can be fractional) in addition to 
the one count block. This gives Pr [Count access in i ?2 a hit] = B/{B + p), and 
Pr[Data access in i ?2 a hit] = 1/{B + p). This analysis is a bit coarse, but it will 
prove accurate enough for our purposes. 
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It is more interesting to study the hit rate in region R \ , which contains only 
data pointers. It will be convenient to assume that R\ covers the entire cache, 
and has m data pointers mapped to it; as we will see, we can scale down the 
values without changing the hit rate. Let the number of pointers mapped to 
cache block i be m^. If cache block i has rrii yf 0, then the probability that a 
data array access reads cache block i is rm /m, but the probability of this access 
being a hit is llrrii. Hence, the probability of a hit given that cache block i was 
accessed is 1/m. Summing over all i such that mt ^ 0 gives the overall hit rate as 
simply vim, where v is the number of cache blocks such that m^ yf 0. Assuming 
that the pointers are independently and uniformly located in cache blocks, we 
get the expected value of as C • (1 — l/C)™ « C • (1 — e~P). Hence we have: 

Pr[Data access in R\ a hit] = vjm = — e”'’), (5) 

and note that this is invariant to scaling C and m by the same amount. Hence, 
on average: 



#missesl n 



1 




B- 




1 



m 

Wi 








where p = mjC . Using this we predict the misses during the first permute phase 
of Flashsort2P below: 



Input Size 


256K 


512K 


IM 


2M 


4M 


8M 


16M 


32M 


64M 


Predicted 


0.112 


0.119 


0.130 


0.145 


0.165 


0.192 


0.230 


0.279 


0.345 


Actual 


0.076 


0.092 


0.134 


0.128 


0.210 


0.181 


0.332 


0.272 


0.511 



(The predicted values again include a term of 1/(2H) to account for misses in- 
curred while searching for cycle leaders.) As can be seen, the predictions are 
rather inaccurate for n = IM, 4M, 16M and 64M. This is because the assump- 
tion of random pointer placement is a particularly poor one for these values. 
Note that the expected starting location of the fc-th pointer in the data array is 
precisely the expected number of elements in classes 0, . . . , fc — 1. It is therefore 
a binomial random variable with expected value k • n/m and standard devia- 
tion • (k/m) ■ (1 — k/m), from which the expected starting location of this 
pointer is easily deduced. If n is a multiple of BC then one can reason about this 
particularly simply. We view the cache as being ‘continuous’ and place m marks 
on the cache numbered 0, 1, . . .,m— 1, with the i-th mark being i- (BC) jm words 
from the beginning of the cache. The expected location in the cache of the fc-th 
pointer is easily seen to be mark (fcr) mod m where r = nl{BC). This shows 
that the choice of m = 4096 for n = 64M = 2^® is quite bad, since r = 512, 
and the expected starting locations of all pointers are marks numbered 512j, for 
integer j. In other words, there are only eight different marks where the pointers 
expected locations lie, and one would expect v to be very small at the outset. 
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To simplify the analysis, we make the assumption that the value of v does 
not change much over the entire course of the permute phase. This is somewhat 
plausible: as the algorithm progresses, one would expect all pointers to move at 
roughly the same rate, hence maintaining the original pattern of pointers. To 
validate this, we performed extensive simulations with the following parameters: 
n = IM, C = IK, B = 16, and all values of m between 256 and 768 (including 
the value m = 512 which would have been chosen by Flashsort2P). The results 
are shown in Figure 0 where for each value of the parameters, we plot: 




Fig. 6. Exhaustive simulation of various values of m. 



— The average number of misses per key over 5 simulations of the permute 
algorithm (‘simulation’ in Figure^; 

— The number of misses per key predicted by EqQ (‘random placement' in 
Figure B; 

— The number of misses per key predicted by EqB but explicitly calculating 
V at the start of the permute phase, (‘non-empty blocks' in Figure0; 

This suggests Eq. Qis a reasonable predictor for most values of m, and also 
that the initial value of v does seem to be a fairly good predictor of the value of 
V over the course of the permute phase. However, it appears that it is difficult to 
obtain a simple closed-form expression for the initial value of v for arbitrary m. 
One way out might be to choose m so that the analysis is made convenient. For 
instance, if we choose m to be relatively prime to r, then we know that all the 
pointers are expected to start at different marks. Unfortunately, the example of 
m = 511 in Fig0shows that there is yet another factor to consider. Notice that 
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r = 64 in our experiment, and gcd(511,64) = 1. Although all the pointers are 
expected to start at different marks, the miss rate for m = 511 is still quite high. 
This is because the expected pointer starting locations are as follows: 



mark 


0 


1 


2 


3 




62 


63 


64 


65 


66 




pointer 


0 


8 


16 


24 




496 


504 


1 


9 


17 





Observe that pointers with low or high indices have low standard deviation, so 
the pointers which will stay close to their expected positions are all clustered 
around marks 0, 64, 128 . . while the pointers which are expected to start around 
mark 32, 96, 160, ... all have high standard deviation and can vary considerably 
from their expected starting position. Hence there is a concentration of values 
again around marks 0, 64, . . .. Based on this, we propose a heuristic which mod- 
ifies m so that the pointers with low and high indices are well-spaced. We do 
not describe this heuristic in detail but for powers of 2, simulations show that 




Fig. 7. Smoothing effect of heuristic when n is a power of 2. 



this heuristic smooths out the variation seen previously (FigQl and indeed for 
arbitrary values of n the heuristic chooses a value of m close to 0.5-^^ such that 
the misses for this value of m are close to what would be predicted by random 
placement (Fig0. 

The improvements produced in actual running time by this heuristic are, 
however, not noticeable on our machine. This is mainly because we have physical 
caches and virtually every data array access causes a TLB miss (see footnote 1 
for a definition of a TLB miss) and hence a memory access, so the improvements 
apply to only half the memory accesses. In addition, the cache miss rate is already 
quite low, so waits for memory do not heavily dominate the running time. Our 
rough calculations suggest that the improvement should be at most 5% on our 
machine, but this heuristic should be more useful on a machine with a higher 
cache miss penalty. 



Analysing Cache Effects in Distribution Sorting 197 




0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07 9e+07 

number of keys 



Fig. 8. Smoothing effect of heuristic for arbitrary values of n. 

6 Conclusions 

We have shown that Flashsortl is fast when sorting a small number of keys, 
but due to poor cache utilisation it starts to perform poorly when the data is 
larger than the cache size. We have shown that a 2-pass variant of Flashsortl, 
called Flashsort2P, outperforms Flashsortl and Quicksort for moderate to large 
values of n, as it make much fewer cache misses. We have analyzed the cache 
miss rates of the Flashsort variants and can accurately predict the miss rates 
in the permute phases of these algorithms. We have also shown that the integer 
sorting algorithm MSB radix sort can be used very effectively on floating point 
data. The algorithm is very fast due to fast integer operations and relatively 
good cache utilisation. 
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Abstract. We present a new algorithm to search regular expressions, 
which is able to skip text characters. The idea is to determine the min- 
imum length £ of a string matching the regular expression, manipulate 
the original automaton so that it recognizes all the reverse prehxes of 
length up to I of the strings accepted, and use it to skip text characters 
as done for exact string matching in previous work. As we show exper- 
imentally, the resulting algorithm is fast, the fastest one in many cases 
of interest. 



1 Introduction 

The need to search for regular expressions arises in many text-based applications, 
such as text retrieval, text editing and computational biology, to name a few. 
A regular expression is a generalized pattern composed of (i) basic strings, (ii) 
union, concatenation and Kleene closure of other regular expressions. Readers 
unfamiliar with the concept and terminology related to regular expressions are 
referred to a classical book such as Q. 

The traditional technique to search a regular expression of length m in 
a text of length n is to convert the expression into a nondeterministic finite 
automaton (NFA) with 0{m) nodes. Then, it is possible to search the text using 
the automaton at 0{mn) worst case time. The cost comes from the fact that 
more than one state of the NFA may be active at each step, and therefore all 
may need to be updated. A more efficient choice Q is to convert the NFA into a 
deterministic finite automaton (DFA), which has only one active state at a time 
and therefore allows to search the text at 0(n) cost, which is worst-case optimal. 
The problem with this approach is that the DFA may have 0(2"*) states, which 
implies a preprocessing cost and extra space exponential in m. 

Some techniques have been proposed to obtain a good tradeoff between both 
extremes. In 1992, Myers E3 presented a four-russians approach which obtains 
0(mn/ log n) worst-case time and extra space. The idea is to divide the syntax 
tree of the regular expression into “modules” , which are subtrees of a reasonable 
size. These subtrees are implemented as DFAs and are thereafter considered as 
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leaf nodes in the syntax tree. The process continues with this reduced tree until 
a single final module is obtained. 

The DFA simulation of modules is done using bit-parallelism, which is a 
technique to code many elements in the bits of a single computer word and 
manage to update all them in a single operation. In this case, the vector of 
active and inactive states is stored as bits of a computer word. Instead of (ala 
Thompson examining the active states one by one, the whole computer word 
is used to index a table which, together with the current text character, provides 
the new set of active states (another computer word). This can be considered 
either as a bit-parallel simulation of an NFA, or as an implementation of a DFA 
(where the identifier of each deterministic state is the bit mask as a whole). 

Pushing even more on this direction, we may resort to pure bit-parallelism 
and forget about the modules. This was done in ^3 by Wu and Manber, and 
included in their software Agrep ^3- computer word is used to represent 
the active (1) and inactive (0) states of the NFA. If the states are properly 
arranged and the Thompson construction is used, all the arrows carry I’s 
from bit positions i to i -I- 1, except for the e-transitions. Then, a generalization 
of Shift-Or (the canonical bit-parallel algorithm for exact string matching) 
is presented, where for each text character two steps are performed. First, a 
forward step moves all the I’s that can move from a state to the next one, 
and second, the e-transitions are carried out. As e-transitions follow arbitrary 
paths, an A : 2"* — > 2"* function is precomputed and stored, where E{w) is 
the e-closure of w. Possible space problems are solved by splitting this table 
“horizontally” (i.e. less bits per entry) in as many subtables as needed, using the 
fact that E{w\W 2 ) = E{wi) or E{w 2 )- This can be thought of as an alternative 
decomposition scheme, instead of Myers’ modules. 

The ideas presented up to now aim at a good implementation of the automa- 
ton, but they must inspect all the text characters. In many cases, however, the 
regular expression involves sets of relatively long substrings that must appear 
for the regular expression to match. In o chapter 5], a multipattern search 
algorithm is generalized to regular expression searching, in order to take advan- 
tage of this fact. The resulting algorithm finds all suffixes (of a predetermined 
length) of words in the language denoted by the regular expression and uses 
the Commentz- Walter algorithm Q to search them. Another technique of this 
kind is used in Gnu Grep v2.0, which extracts a single string (the longest) which 
must appear in any match. This string is searched for and the neighborhoods 
of its occurrences are checked for complete matches using a lazy deterministic 
automaton. Note that it is possible that there is no such single string, in which 
case the scheme cannot be applied. 



In this paper, we present a new regular expression search algorithm able to 
skip text characters. It is based on extending BDM and BNDM QQ. These 
are simple pattern search algorithms whose main idea is to build an automaton 
able to recognize the reverse prefixes of the pattern, and examine backwards a 
window of length m on the text. This automaton helps to determine (i) when 
it is possible to shift the window because no pattern substring has been seen. 
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and (ii) the next position where the window can be placed, i.e. the last time 
that a pattern prefix was seen. BNDM is a bit-parallel implementation of this 
automaton, faster and much simpler than the traditional version BDM which 
makes the automaton deterministic. 

Our algorithm for regular expression searching is an extension where, by ma- 
nipulating the original automaton, we search for any reverse prefix of a possible 
match of the regular expression. Hence, this transformed automaton is a compact 
device to achieve the same multipattern searching, at much less space. The au- 
tomata are simulated using bit-parallelism. Our experimental results show that, 
when the regular expression does not match too short or too frequent strings, 
our algorithm is among the fastest, faster than all those unable to skip characters 
and in most cases faster than those based on multipattern matching. An extra 
contribution is our bit-parallel simulation, which differs from Agrep’s in that 
less bits are used and no e-transitions exist, although the transitions on letters 
are arbitrary and therefore a separate table per letter is needed (the tables can 
be horizontally split in case of space problems) . Our simulation turns out to be 
faster than Agrep and the fastest in most cases. 

Some definitions that are used in this paper follow. A word is a string or 
sequence of characters over a finite alphabet S. A word x € E* is a, factor (or 
substring) of p if p can be written p = uxv, u,v G E* . A factor a; of p is called 
a suffix (resp. prefix) of p is p = ux (resp. p = xu), u G E*. We call R our 
pattern (a regular expression), which is of length m. We note L{R) the set of 
words generated by R. Our text is of size n. 

We define also the language to denote regular expressions. Union is denoted 
with the infix sign “|”, Kleene closure with the postfix sign and concate- 
nation simply by putting the sub-expressions one after the other. Parentheses 
are used to change the precedence, which is normally “|”. We adapt 

some widely used extensions: [ci...Cfc] (where Ci are characters) is a shorthand 
for (ci|...|cfc). Instead of a character c, a range c^-c^ can be specified to avoid 
enumerating all the letters between (and including) and c^. Finally, the period 
(.) represents any character. 



2 The Reverse Factor Search Approach 

In this section we describe the general reverse factor search approach currently 
used for a single pattern or multiple patterns 

The search is done using a window which has the length of the minimum 
word that we search (if we search a single word, we just take its length). We 
note this minimum length i. 

We shift the window along the text, and for each position of the window, 
we search backwards (i.e from right to left, see Figure^ for any factor of any 
length-^ prefix of our set of patterns (if we search a single word, this means 
any factor of the word). Also, each time we recognize a factor which is indeed 
a prefix of some of the patterns, we store the window position in a variable last 
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(which is overwritten, so we know the last time that this happened). Now, two 
possibilities appear: 

(i) We do not reach the beginning of the window. This case is shown in Figure^ 
The search for a factor fail on a letter cr, i.e au is not a factor of a length-^ 
prefix of any pattern. We can directly shift the window to start at position 
last^ since no pattern can start before, and begin the search again. 

(ii) We reach the beginning of the window. If we search just one pattern, then 
we have recognized it and we report the occurrence. Otherwise, we just 
recognized a length-^ prefix of one or more patterns. We verify directly in 
the text if there is a match of a pattern, with a forward (i.e left to right) 
scan. This can be done with a trie of the patterns. Next, in both cases, we 
shift the window according to position last. 





Window 




last 




1 1 1 1 1 1 1 1 1 













Search for a factor 

Record in last the window position when a prefix of any pattern is recognized 





last 






1 1 1 1 IfT 1 








1 







Fail of the search for a factor in cr. 
The maximum prefix starts at last 



safe shift New window 



Fig. 1. The reverse factor search approach. 



This simple approach leads to very fast algorithms in practice, such as BDM |j^ 
and BNDM 0. For a single pattern, this is optimal on average, matching Yao’s 
bound El of 0{n\og{tj / tj (where n is the text size and i the pattern length). 
In the worst case, this scheme is quadratic {0{n£) complexity). There exists 
however a general technique to keep the algorithms sub-linear on average and 
linear in the worst case. 



2.1 A Linear Worst Case Algorithm 

The main idea used in is to avoid retraversing the same characters in 

the backward window verification. We divide the work done on the text in two 
parts: forward and backward scanning. To be linear in the worst case, none of 
these two parts must retraverse characters. In the forward scan, it is enough to 
keep track of the longest pattern prefix v that matches the current text suffix. 
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This is easily achieved with a KMP automaton mi (for one pattern) or an Aho- 
Corasick automaton [J] (for multiple patterns) . All the matches are found using 
the forward scan. 

However, we need to use also backward searching in order to skip characters. 
The idea is that the window is placed so that the current longest prefix matched 
V is aligned with the beginning of the window. The position of the current text 
character inside the window (i.e. |r;|) is called the critical position. At any point 
in the forward scan we can place the window (shifted |u| characters from the 
current text position) and try a backward search. Clearly, this is only promising 
when V is not very long compared to £. Usually, a backward scan is attempted 
when the prefix is less than [^/aj, where 0 < a < .^ is fixed arbitrary (usually 
a = 2). 

The backward search proceeds almost as before, but it finishes as soon as the 
critical position is reached. The two possibilities are: 

(i) We reach the critical position. This case is shown in Figure Q In this case 
we are not able to skip characters. The forward search is resumed in the 
place where it was left (i.e. from the critical position), totally retraverses the 
window, and continues until the condition to try a new backward scan holds 
again. 



Window 





V 


u 














Critpoi^ 





We reached the critical position 





1 1 1 1 1 


1 1 1 1 1 1 1 1 1 1 1 




[ 







Re-read with a forwa 


•d search 



Window 



1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


1 1 1 1 1 1 1 1 1 1 




End of the forward search 


v' Critpos' 




back to a normal state 


Window 





Fig. 2. The critical position is reached, in the linear-time algorithm. 



(ii) We do not reach the critical position. This case is shown in Figure Q This 
means that there cannot be a match in the current window. We start a 
forward scan from scratch at position last, totally retraverse the window, 
and continue until a new backward scan seems promising. 
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Fail of the search for a factor in a. 



last 




Safe shift ^ 

Forward search from last. 



End of the forward search 
back to the current situation 




Critpos ’ 



New window 



Fig. 3. The critical position is not reached, in the linear-time algorithm. 



3 Extending the Approach to Regular Expression 
Searching 

In this section we explain how to adapt the general approach of Section 0 to 
regular expression searching. We first explain a simple extension of the basic 
approach and later show how to keep the worst case linear. Recall that we search 
for a regular expression called R which is of size m and generates the language 
L{R). 

3.1 Basic Approach 

The search in the general approach needs a window of length i (shortest pattern 
we search) . In regular expression searching this corresponds to the length of the 
shortest word of L{R). Of course, if this word is e, the problem of searching is 
trivial since every text position matches. We consider in the rest of the paper 
that £ > 0. 

We use the general approach of Section 0 consisting of a backward and, if 
necessary (i.e we reached the beginning of the window), a forward scan. To adapt 
this scheme to regular expression search, we need two modifications: 

(i) The backward search step of the general approach imposes here that we are 
able recognize any factor of the reverse prefixes of length £ of L{R). Moreover, 
we mark in a variable last the longest prefix of L(R) recognized (of course 
this prefix will be of length less than £) . 

(ii) The forward search (if we reached the beginning of the window) verifies that 
there is a match of the regular expression starting at the beginning of the 
window (however, the match can be much longer than £). 
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We detail now the steps of the preprocessing and searching phases. Com- 
plexities will be discussed in Section ^because they are related to the way the 
automata are built. 

Preprocessing The preprocessing consists of 3 steps: 

1. Build the automaton that recognizes R. We note it F{R), and its specific 
construction details are deferred to the next section. 

2. Determine £ and compute the set Pi{R) of all the nodes of F{R) reachable in 
i steps or less from the initial state, for each 0 < i ^ (so Pi{R) C Pij^i{R)). 
Both things are easily computed with a breadth-first search from the initial 
state until a final node is reached (being then £ the current depth at that 
point). 

3. Build the automaton B{R) that recognizes any factor of the reverse prefixes 
of length £ of L{R). This is achieved by restricting the original automa- 
ton F{R) to the nodes of Pi{R), reversing the arrows, taking as (the only) 
terminal state the initial state of F{R) and all the states as initial states. 

The most interesting part of the above procedure is B{R), which is a device 
to recognize the reverse factors of prefixes of length £ of L{R). It is not hard 
to see that any such factor corresponds to a path in F{R) that touches only 
nodes in Pi{R). In B{R) there exists the same path with the arrows reversed, 
and since all the states of B{R) are initial, there exists a path from an initial 
state that spells out the reversed factor. Moreover, if the factor is a prefix, then 
the corresponding path in B{R) leads to its final state. 

Note, however, that B(R) can recognize more words than desired. For in- 
stance, if there are loops in B{R) it can recognize words longer than £. However, 
we can restrict more the set of words recognized by B{R). The idea is that, if 
a state of B{R) is active but it is farther than i positions to the final state of 
B{R), and only i window characters remain to be read, then this state cannot 
lead to a match. Hence, if we have to read i more characters of the window, we 
intersect the current active states of B{R) with the set Pi{R). 

It is easy to see that, with this modification, the automaton recognizes exactly 
the desired prefixes, since if a state has not been “killed” with the intersection 
with Pi{R) it is because it is still possible to obtain a useful prefix from it. Hence, 
only the desired (reverse) factors can survive all the process until they arrive to 
the final state and become (reverse) prefixes. 

In fact, an alternative method in this sense would be a classical multi-pattern 
algorithm to recognize the reverse factors of the set of prefixes of length £ of L{R). 
However, this set may be large and the resulting scheme may need much more 
memory. The automaton B(R) is a more compact device to obtain the same 
result. 

Searching The search follows the general approach of SectionQ For each window 
position, we activate all the states of B{R) and traverse the window backwards 
updating last each time the final state of B{R) is reached (recall that after each 
step, we “kill” some states pf B{R) using Pi{R)). If B{R) runs out of active states 



Fast Regular Expression Search 205 



we shift the window to position last. Otherwise, if we reached the beginning of the 
window, we start a forward scan using F{R) from the beginning of the window 
until either a match is founcfl we reached the end of the text, or F{R) runs out 
of active states. After the forward scan, we shift the window to position last. 



3.2 Linear Worst Case Extension 

We also extended the general linear worst case approach (see Section^^ to the 
case of regular expression searching. 

We transform the forward scan automaton F{R) of the previous algorithm 
by adding a self-loop at its initial state, for each letter of E (so now it recognizes 
E*L{R)). This is the normal automaton used for classical searching, and the one 
we use for the forward scanning. 

The main difficulty to extend the general linear approach is where to place 
the window in order to not lose a match. The general approach considers the 
longest prefix of the pattern already recognized. However, this information can- 
not be inferred only from the active states of the automaton (for instance, it is 
not known how many times we have traversed a loop). We use an alternative 
concept: instead of considering the longest prefix already matched, we consider 
the shortest path to reach a final state. This value can be determined from the 
current set of states. We devise two different alternatives that differ on the use 
of this information. 

Prior to explaining both alternatives, we introduce some notation. In general, 
the window is placed so that it finishes F characters ahead of the current text 
position (for 0 < ^' < ^). To simplify our explanation, we call this F the “forward- 
length” of the window. 

In the first alternative the forward-length of the window is the shortest path 
from an active state of F{R) to a final state (this same idea has been used for 
multipattern matching in §). In this case, we need to recognize any reverse 
factor of L{K) in the backward scan (not only the factors of prefixes of length 

Each time F is large enough to be promising (F > al^ for some heuristically 
fixed a), we stop the forward scan and start a backward scan on a window of 
forward-length F (the critical position being i — F). If the backward automaton 
runs out of active states before reaching the critical position, we shift the window 
as in the general scheme (using the last prefix found) and restart a fresh forward 
scan. Otherwise, we continue the previous forward scan from the critical posi- 
tion, totally traversing the window and continuing until the condition to start a 
backward scan holds again. 

The previous approach is linear in the worst case (since each text position 
is scanned at most once forward and at most once backwards), and it is able to 

^ Since we report the beginning of matches, we stop the forward search as soon as we 
find a match. 

^ A more strict choice is to recognize any reverse factor of any word of length F that 
starts at an active state in F{R), but this needs much more space and preprocessing 
time. 
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skip characters. However, a problem is that all the reverse factors of L{R) have 
to be recognized, which makes the backward scans longer and the shifts shorter. 
Also, the window forward-length i' is never larger than our previous i, since the 
initial state of F{R) is always active. 

The second alternative solves some of these problems. The idea now is that 
we continue the forward scan until all the active states belong to Pi{R), for 
some fixed i < i (say, i = i/2). In this case, the forward-length of the window 
is £' = i — i, since it is not possible to have a match after reading that number 
of characters. Again, we select heuristically a minimum £' = a£ value. In this 
case, we do not need to recognize all the factors. Instead, we can use the already 
known B{R) automaton. Note that the previous approach applied to this case 
(with all active states belonging to Pi{R)) yields different results. In this case 
we limit the set of factors to recognize, which allows to shift the window sooner. 
On the other hand, its forward-length is shorter. 



4 Building an NFA from a Regular Expression 



There exist currently many different techniques to build an NFA from a regular 
expression R of size m. The most classical one is the Thomson construction M- 
It builds an NFA with at most 2m states that present some particular properties. 
Some algorithms like that of Myers ^>^4 of Wu and Manber in Agrep 
make use of these properties. 

A second one is the Glushkov’s construction, popularized by Berry and Sethi 
in The NFA resulting of this construction has the advantage of having just 
m -|- 1 states (one per position in the regular expression) . A lot of research on 
Glushov’s construction has been pursued, like Q, where it is shown that the 
resulting NFA is quadratic in the number of edges in the worst case. In cg,a 
long time open question about the minimal number of edges of an NFA with 
linear number of states was answered, showing a construction with 0(m) states 
and 0(m(logm)^) edges, as well as a lower bound of 0(m log m) edges. Hence, 
Glushkov construction is not space-optimal. 

Some research has been done also to try to construct directly a DFA from a 
regular expression, without constructing an NFA, such as Q . 

For our purpose, when we consider bit-parallelism, the most interesting is to 
have a minimal number of states, because we manage computer words of a fixed 
length w to represent the set of possible states. Hence, we choose the original 
Gluskov’s construction, which leads to an NFA with m-l-1 states and a quadratic 
(in the worst case) number of edges. The number of edges is unimportant for 
our case. 

In Gluskov’s construction, the edges have no simple structure, and we need 
a table which, for each current set of states and each current text character, 
gives the new set of states. On the other hand, the construction of Wu and 
Manber uses the regularities of Thompson’s construction so that they need only 
a table for the e-transitions, not for every character. In exchange, we have m-|- I 
states instead of nearly 2m states, and hence their table sizes square ours. As 
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we show later experimentally, our NFA simulation is faster than those based on 
the Thompson construction, so the tradeoff pays off. 



5 Experimental Results 

We compare in this section our approach against previous work. We divide this 
comparison in three parts. First, we compare different existing algorithms to 
implement an automaton. These algorithms process all the text characters, one 
by one, and they only differ in the way they keep track of the state of the search. 
The goal of this comparison is just to show that our simulation is competitive. 
Second, we compare, using our automaton simulation, a simple forward-scan 
algorithm against the different variants of backward search proposed, to show 
that backward searching is faster in general. Finally, we compare our backward 
search algorithm against other algorithms that are also able to skip characters. 

We use an English text (writings of B. Franklin), filtered to lower-case and 
replicated until obtaining 10 Mb. A major problem when presenting experi- 
ments on regular expressions is that there is not a concept of “random” regular 
expression, so it is not possible to search, say, 1,000 random patterns. Lacking 
such good choice, we fixed a set of 10 patterns, which were selected to illustrate 
different interesting cases rather than more or less “probable” cases. In fact we 
believe that common patterns have long exact strings and our algorithm would 
behave even better than in these experiments. Therefore, the goal is not to show 
what are the typical cases in practice but to show how the scheme behaves under 
different characteristics of the pattern. 

The patterns are given in Table Q We also show their number of letters, 
which is closely related to the size of the automata recognizing them, the mini- 
mum length of a match for each pattern, and a their empirical matching proba- 
bility (number of matches divided by n). The period (.) in the patterns matches 
any character except the end of line (lines have approximately 70 characters). 



No. 


Pattern 


Size 

(# letters) 


Minimum 
length £ 


Prob. match 
(empirical) 


1 


benj amin I franklin 


16 


8 


.00003586 


2 


benj amin I franklin I writing 


23 


7 


.0001014 


3 


[a-z] [a-zO-9] * [a-z] 


3 


2 


.6092 


4 


benj . *min 


8 


7 


.000007915 


5 


[a-z] [a-z] [a-z] [a-z] [a-z] 


5 


5 


.2024 


6 


(benj . *min) I (fra. *lin) 


15 


6 


.00003586 


7 


ben(a| (j |a)*)min 


9 


6 


.009491 


8 


be . * j a. *in 


8 


6 


.00001211 


9 


ben [j 1] amin 


8 


8 


.000007915 


10 


(be 1 f r) (nj I an) (am |kl)in 


14 


8 


.00003586 



Table 1. The patterns used on English text. 



208 



G. Navarro, M. RafRnot 



Our machine is a Sun UltraSparc-1 of 167 MHz, with 64 Mb of RAM, running 
Solaris 2.5.1. We measured CPU times in seconds, averaging 10 runs over the 10 
Mb (the variance was very low). We include the time for preprocessing in the 
figures. 

5.1 Forward Scan Algorithms 

In principle, any forward scan algorithm can be enriched with backward search- 
ing to skip characters. Some are easier to adapt than others, however. In this 
experiment we only consider the performance of the forward scan method we 
adopted. The purpose of this test is to show that our approach is competitive 
against the rest. We have tested the following algorithms for the forward scanning 
(the implementations are ours except otherwise stated) . See the Introduction for 
detailed descriptions of previous work. 

DFA: builds the classical deterministic automaton and runs it over the text. 
We have not minimized the automaton. 

Thompson: simulates the nondeterministic automaton by keeping a list of ac- 
tive states which is updated for each character read (this does not mean that 
we build the automaton using Thompson’s method). 

BP-Thompson: same as before, but the set of active states is kept as a bit 
vector. Set manipulation is faster when many states are active. 

Agrep: uses a bit mask to handle the active states ^3- The software ^3 is 
from S. Wu and U. Manber, and has an advantage on frequently occurring 
patterns because it abandons a line as soon as it finds the pattern on it. 
Myers: is the algorithm based on modules implemented as DFAs ^3- The code 
is from G. Myers. 

Ours: our forward algorithm, similar to that of Agrep except because we elimi- 
nate the e-transitions and have a separate transition table for each character 
(Section 0. 

Except for Agrep and Myers, which have their own code, we use the NFA 
construction of Section^ TableQshows the results on the different patterns. As 
it can be seen, the schemes that rely on nondeterministic simulation (Thompson 
variants) worsen when the combination of pattern size and matching probability 
increases. The rest is basically insensitive to the pattern, except because all 
worsen a little when the pattern matches very frequently. If the pattern gets 
significantly larger, however, the deterministic simulations worsen as well, as 
some of them are even exponentially depending on the automaton size. Agrep, 
Myers and Ours can adapt at higher but reasonable costs, proportional to the 
pattern length. This comes not only from the possible need to use many machine 
words but also because it may be necessary to cut the tables horizontally. 

With respect to the comparison, we point out that our scheme is competitive, 
being the fastest in many cases, and always at most 5% over the performance 
of the fastest. DFA is the best in the other cases. Our algorithm can in fact be 
seen as a DFA implementation, where our state identifier is the bit mask and the 
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Pattern 


DFA 


Thompson 


BP-Thompson 


Agrep 


Myers 


Ours 


Ours/best 


1 


0.70 


4.47 


4.19 


1.88 


5.00 


0.68 


1.00 


2 


0.73 


4.13 


5.10 


1.89 


8.57 


0.76 


1.04 


3 


1.01 


18.2 


3.75 


0.98 


2.19 


0.99 


1.01 


4 


0.71 


4.16 


3.17 


0.97 


2.17 


0.68 


1.00 


5 


0.87 


18.7 


4.32 


1.05 


2.18 


0.82 


1.00 


6 


0.76 


4.25 


4.06 


1.87 


4.94 


0.72 


1.00 


7 


0.73 


4.67 


2.82 


0.99 


2.17 


0.72 


1.00 


8 


0.72 


4.93 


3.40 


0.96 


2.18 


0.73 


1.01 


9 


0.66 


4.75 


3.11 


1.00 


2.16 


0.68 


1.03 


10 


0.71 


4.36 


3.97 


1.86 


5.01 


0.73 


1.03 



Table 2. Forward search times on English, in seconds for 10 Mb. 



transition table is the one we use. However, the DFA has less states, since most 
of the bit combinations we store are in fact unreachablfl On the other hand, 
the bit-parallel implementation is much more flexible when it comes to adapt it 
for backward searching or to extend it to handle extended patterns or to allow 
errors. 

We have left aside lazy deterministic automata implementations. However, 
as we show in Section ^3 these also tend to be slower than ours. 

5.2 Forward Versus Backward Scanning 

We compare now our new forward scan algorithm (called Fwd in this section 
and Ours in Section^3 against backward scanning. There are three backward 
scanning algorithms. The simplest one, presented in Section^3 is called Bwd. 
The two linear variations presented in Section ^3 are called LBwd-All (that 
recognizes all the reverse factors) and LBwd-Pref (that recognizes reverse fac- 
tors of length-^ prefixes) . The linear variations depend on an a parameter, which 
is between 0 and 1. We have tested the values 0.25, 0.50 and 0.75 for a, although 
the results change little. 

TableQshows the results. We obtained improvements in 7 of the 10 patterns 
(and very impressive in four cases). In general, the linear versions are quite bad 
in comparison with the simple one, although in some cases they are faster than 
forward searching. It is difficult to determine which of the two versions is better 
in which cases, and which is the best value for a. 

5.3 Character Skipping Algorithms 

Finally, we consider other algorithms able to skip characters. Basically, the 
other algorithms are based in extracting one or more strings from the regular 

® We do not build the transition table for unreachable states, but we do not compact 
reachable states in consecutive table positions as the DFA implementation does. This 
is the essence of the bit-parallel implementation. 
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Pattern 


Fwd 


Bwd 


a = 0.25 


LBwd-Al 
a = 0.50 


a = 0.75 


I 

a = 0.25 


jBwd-Pre 
a = 0.50 


f 

a = 0.75 


1 


0.68 


0.28 


0.44 


0.43 


0.46 


0.47 


0.49 


0.50 


2 


0.76 


0.65 


1.17 


1.00 


1.09 


0.93 


0.98 


0.95 


3 


0.99 


2.37 


3.30 


2.59 


3.01 


2.56 


2.56 


2.56 


4 


0.68 


0.56 


1.70 


1.68 


1.71 


0.94 


0.93 


0.92 


5 


0.82 


2.02 


2.05 


2.40 


2.09 


2.13 


2.15 


2.18 


6 


0.72 


0.70 


1.82 


1.85 


1.84 


1.10 


1.12 


1.09 


7 


0.72 


0.30 


0.46 


0.45 


0.47 


0.51 


0.52 


0.51 


8 


0.73 


0.91 


1.75 


1.85 


1.87 


1.33 


1.45 


1.47 


9 


0.68 


0.24 


0.37 


0.37 


0.39 


0.41 


0.39 


0.41 


10 


0.73 


0.29 


0.42 


0.45 


0.44 


0.47 


0.46 


0.48 



Table 3. Backward search times on English, in seconds for 10 Mb. 



expression, so that some of those strings must appear in any match. A single- 
or multi-pattern exact search algorithm is then used as a filter, and only where 
some string in the set is found its neighborhood is checked for an occurrence of 
the whole regular expression. Two approaches exist: 

Single pattern: one string is extracted from the regular expression, so that the 
string must appear inside every match. If this is not possible the scheme can- 
not be applied. We use Gnu Grep v2.3, which implements this idea. Where 
the filter cannot be applied. Grep uses a forward scanning algorithm which 
is 30% slower than our Hence, we plot this value only where the idea 

can be applied. We point out that Grep also abandons a line when it finds a 
first match in it. 

Multiple pattern: this idea was presented in A length ^ is selected, 
and all the possible suffixes of length i' of L{R) are generated and searched 
for. The choice of i' is not obvious, since longer strings make the search faster, 
but there are more of them. Unfortunately, the code of is not public, so 
we have used the following procedure: first, we extract by hand the suffixes 
of length t' for each regular expression; then we use the multipattern search 
of Agrep ^3, which is very fast, to search those suffixes; and finally the 
matching lines are sent to Grep, which checks the occurrence of the regular 
expression in the matching lines. We find by hand the best £' value for each 
regular expression. The resulting algorithm is quite similar to the idea of 




Our algorithms are called Fwd and Bwd and correspond to those of the 
previous sections. Table0shows the results. The single pattern filter is a very 
effective trick, but it can be applied only in a restricted set of cases. In some 
cases its improvement over our backward search is modest. The multipattern 

^ Which shows that our implementation is faster than a good lazy deterministic au- 
tomaton implementation. 
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filter, on the other hand, is more general, but its times are higher than ours in 
general, especially where backward searching is better than forward searching 
(an exception is the 2nd pattern, where we have a costly preprocessing). 



Pattern 


Fwd 


Bwd 


Single pattern 
filter 


Multipattern 

filter 


1 


0.68 


0.28 


— 


0.31 


2 


0.76 


0.65 


— 


0.37 


3 


0.99 


2.37 


— 


1.65 


4 


0.68 


0.56 


0.17 


0.87 


5 


0.82 


2.02 


— 


2.02 


6 


0.72 


0.70 


— 


1.00 


7 


0.72 


0.30 


0.26 


0.44 


8 


0.73 


0.91 


0.63 


0.66 


9 


0.68 


0.24 


0.19 


0.31 


10 


0.73 


0.28 


0.98 


0.35 



Table 4. Algorithm comparison on English, in seconds for 10 Mb. 



6 Conclusions 

We have presented a new algorithm for regular expression searching able to 
skip characters. It is based on an extension of the backward DAWG matching 
approach, where the automaton is manipulated to recognize reverse prefixes of 
strings of the language. We also presented two more complex variants which 
are of linear time in the worst case. The automaton is simulated using bit- 
parallelism. 

We first show that the bit-parallel implementation is competitive (at most 
5% over the fastest one in all cases). The advantage of bit-parallelism is that the 
algorithm can easily handle extended patterns, such as classes of characters, wild 
cards and even approximate searching. We then compare the backward matching 
against the classical forward one, finding out that the former is superior when 
the minimum length of a match is not too short and the matching probability is 
not too high. 

Finally, we compare our approach against others able to skip characters. 
These are based on filtering the search using multipattern matching. The ex- 
periments show that our approach is faster in many cases, although there exist 
some faster hybrid algorithms which can be applied in some restricted cases. 
Our approach is more general and performs reasonably well in all cases. 

The preprocessing time is a subject of future work. In our experiments the 
patterns were reasonably short and the simple technique of using one transi- 
tion table was the best choice. However, longer patterns would need the use 
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of the table splitting technique, which worsens the search times. More work on 
minimizing the NFA could improve the average case. 

Being able to skip characters and based on an easily generalizable technique 
such as bit-parallelism permits to extend our scheme to deal with other cases, 
such as searching a regular expression allowing errors, and being still able to 
skip characters. This is also a subject of future work. 
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Abstract. A common paradigm in data structures is to combine two dif- 
ferent kinds of data structures into one, yielding a hybrid data structure 
with improved resource bounds. We perform an experimental evaluation 
of hybrid data structures in the context of maintaining a dynamic or- 
dered set whose items have integer or floating-point keys. Among other 
things we demonstrate clear speedups over library implementations of 
search trees, both for predecessor queries and updates. Our implemen- 
tations use very little extra memory compared to search trees, and are 
also quite generic. 



1 Introduction 



Solutions to data structuring problems often involve trade-offs: for example, some 
problems can be solved faster if the data structure is allowed more space, while 
in other problems faster updates may come at the expense of slower queries. 
In some cases, two different data structures for the same problem may lie on 
opposite extremes of the trade-off, and it may be possible to combine these data 
structures into a hybrid data structure which outperfoms each individual data 
structure. This principle has been used in the theoretical data structure litera- 
ture in external-memory data structures and in practical 

implementations of indices for text searching ^3- 

We study hybrid data structures in the context of the dynamic predecessor 
problem, which is to maintain a set S of keys drawn from an ordered universe U 
under the operations of insertion, deletion and the predecessor query operation, 
which, given some x G U as input, returns pred{x, S) = maxjy G S \ y < a;}. 
The dynamic predecessor problem is a key step in a number of applications, and 
most libraries of data structures provide an ADT for this problem, e.g. LEDA 
provides the sorted sequence ADT and STL provides the set ADT which 
has similar functionality. These library implementations are based on balanced 
search trees Q or skip lists tEHH and obtain information about the relative order 
of two elements of U only through pairwise comparisons. However, additional 
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knowledge about the universe U can lead to asymptotically more efficient data 
structures than would be possible in the comparison-based framework. 

For example, a number of data structures for the RAM model have been 
proposed for this problem when the keys are integers. Some of these achieve 
0(\/logn) query and update times (see for a survey of results). Although 
these algorithms appear to be too complex to be competitive, there is a simple 
alternative, the digital search tree or trie p492ff]. There are many papers — 
both theoretical and experimental — that deal with tries (see e.g. KllMKl i. but 
these mostly focus on the data type of variable-length strings. Also, the emphasis 
is mostly on dictionary queries (given a key x, does it belong to S7) which are 
easier than predecessor queries in the context of triesuA common conclusion 
from these papers is that tries are slow to update and are memory-hungry. On 
the other hand, balanced search trees are (relatively) slow to search but are 
quick to update once the item has been located. For example, red-black trees 
and skiplists take 0(1) amortized or expected time to perform an update if the 
location of the update is known. Since the strengths of tries and search trees are 
complementary, one could try to combine the two into a hybrid data structure. 

In a hybrid data structure for the dynamic predecessor problem, the keys 
are partitioned into a collection of buckets, such that for any two buckets B, B' , 
B ^ B' , either it holds that maxR < mini?' or vice versa. The keys belonging to 
each bucket are stored in a dynamic predecessor data structure for that bucket. 
Each bucket is associated with a representative key, and the representative keys 
of all buckets are stored in a top-level data structure, which itself is a dynamic 
predecessor data structure. The representative key k for a bucket B is such 
that k < mini?, and for any B' ^ B either k < mini?' or fc > maxi?' (as a 
concrete example, the representative of a bucket could just be the smallest key 
stored in it, but in general k need not belong to i?). From this it follows that 
if a key x is in the hybrid data structure, it must belong to the bucket whose 
representative is x’s predecessor in the top-level data structure. Hence, searching 
in a hybrid is simply a predecessor search in the top-level followed by a search 
in the appropriate bucket. 

The size of the buckets is determined by a parameter b which can be viewed 
as the ‘ideal’ bucket size. If a bucket gets too large or too small, it is brought 
back to a near-ideal size by operations such as: splitting a bucket into two, 
joining two buckets or transferring elements from one bucket to another. These 
operations may cause the set of representatives to change, but this is done so 
that a sequence of n inserts or deletes, starting from an empty data structure, 
result in 0{n/b) changes to the top level. 

As can be seen, even relatively small bucket sizes should greatly reduce the 
number of operations on, and hence the number of keys in, the top level substan- 
tially. This suggests that a trie-search tree hybrid should have good all-round 
performance, as the update time and memory use of the trie should be greatly 
reduced, while the search would be slightly slower due to the use of search trees 

^ A few recent papers also deal with a kind of prefix-matching problem which appears 
in IP routers (see eg Q). 
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in the (small) buckets. Hybrids of tries and trees have been shown to have rea- 
sonable asymptotic performance, e.g. shows 0{^/w) time and 0{n) space, 
where w is the word size of the machine. 

Since tries and search trees have each been studied extensively, it may seem 
that there is little need for an investigation of a data structure which combines 
the two. However, the use of these data structures in a hybrid introduces certain 
differences, so existing studies may not suffice: 

(i) Buckets need to support relatively complex operations such as split or join, 
which are not as well studied in the literature. 

(ii) Bucket sizes are much smaller than usual test input sizes, so we are interested 
in performance for ‘small’ n rather than ‘large’ n. 

(iii) In most tests of search trees, the data structure being tested is the main 
entity in memory and it is likely to have use of much of the cache. However, 
in a hybrid data structure, the top-level data structure is more likely to occupy 
cache, and the data structure for each bucket will have to be brought into cache 
each time it is needed. Hence, data structures with poor cache performance when 
used as isolated data structures (e.g. splay trees) may prove more competitive 
in a hybrid data structure. 

Also, a trie in a hybrid data structure may store keys drawn from somewhat 
unusual distributions. This is because the distribution of representatives in the 
top-level data structure is the result of applying the representative-selection pro- 
cess to a sequence of inputs drawn from the input distribution. This difference 
in distributions can lead to tangible effects in performance, as we note in Sec- 
tion0 Since the process of choosing representatives can be arbitrarily complex, 
the average-case properties of the trie may even be intractable to exact analysis. 
Similarly, statistical properties of buckets can also be hard to obtain. 

In addition, there is also some interest from the software interface design 
aspect, as it is important for data structures to be generic, i.e. to work unchanged 
with different data types EB- Some thought needs to go into making the trie 
generic, as it works with the representation of a key rather than its value. Our 
current implementation supports the keys of type int, unsigned int, float 
or double. The hybrid data structure also provides another valuable kind of 
genericity, the ability to ‘plug-and-play’ with different top-level data structures, 
as well as to use various search tree implementations in the buckets. 



2 Implementations 

All implementations have been done in C-|— I- (specifically g++ version 2.8.1). 
There are four different main classes in the hybrid data structure at present: 



The bucket base class The class Bucket is an empty class used for commu- 
nication between the top-level data structure and the collection of buckets 
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The hybrid class The external interface to the user is provided by a class 
template Hybrid, which takes four arguments: the type of key to be stored, the 
type of the information associated with each key, the class which is to be used 
for the top level data structure and the class which is to be used at the bottom 
level. The hybrid data structure currently supports three main operations: 

void insert (fcey.type k, item_type i) ; 

Inserts key k and associated information i, if k is not already present. If k 
is present, then we change the information associated with k to i. 
void locate_pred(fcey_tj/pe k, key_type fepred, info_type &pred_info) ; 
Returns the predecessor of k in pred, and information associated with pred 
in predjLnf o. If k does not have a predecessor these values are undefined, 
bool del(key_type k) ; 

Deletes k together with its associated information i if k is present. Otherwise, 
the operation is null. 

An instance of Hybrid contains an instance each of the top-level class and the 
bottom-level class. The top-level class and the bottom-level data structure are 
expected to implement certain interfaces. In particular, the top-level class should 
be generic, implement dynamic predecessor operations, should hold pairs of the 
type {keyJ,ype, Bucket *) and be searchable by the key-type component. The 
bottom-level class should be a collection of buckets class (described in greater 
detail below) and will hold pairs of the form {key -type , inf o -type) , namely the 
pairs which are input by the user. 

The hybrid data structure implements its externally-available functions by 
calling the appropriate functions in the top and bottom-level data structures, as 
the following example shows: 

void locate_pred(fce?/_t?/pe k, key-type fepred, inf O-type &pred_info) { 
Bucket* B; 
key-type Bkey; 

Top . locate_pred(k, Bkey, B) ; 

Bottom. locate_pred(B, k, pred, pred_info) ; 

} 



As all Hybrid functions are inline, an optimising compiler should eliminate 
the calls to Hybrid and hence avoid the overhead of the indirection. 



The trie class The class template Trie takes two arguments, the type of key 
to be stored and the type of information associated with it. Internally, the trie 
operates on a fixed- length string of ‘symbols’. A data type such as an integer or 
a floating-point number is viewed as such a string by chopping up its (bit-string) 
representation, into chunks of bits, each chunk being viewed as a symbol. 

The problem with the string representation is that the representation rep{x) 
of a value x need not be ordered the same way as x even for common data types 
X. For unsigned integers the representation and the value of a key obey the same 
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ordering, that is a; < y iff rep{x) < rep{y) (the bit-strings are compared lexi- 
cographically) . Also, signed integers and floating point numbers which obey the 
IEEE 754 standard also satisfy the relation a; < y iff rep{x) < rep{y) provided 
both X and y are non-negative. 

However, signed negative integers are represented by their 2s complement, so 
that a; < y iff rep{x) < rep(y) holds even when x and y are both negative, but if 
a; < 0 and y > 0, then rep{x) > rep{y). Finally, negative floating-point numbers 
use the signed-magnitude representation, so that a; < y iff rep{x) > rep{y) holds 
whenever a; < 0. Hence before processing each key in the trie, we must transform 
it as follows: 

signed integers: complement MSB. 

floating point numbers: complement MSB of non-negative numbers, complement 

all bits for negative numbers. 

The trie class implements (overloaded) functions which convert a key of the 
types int, unsigned int, float and double into a string representation. When 
the template is instantiated, the compiler chooses the appropriate function from 
these, with a compile-time error resulting if the key type is not among the above 
data types. 

We now summarise the main design choices in implementing a trie. We have 
implemented a compressed trie, i.e. where there are no internal nodes of out- 
degree 1 . Although compression does not offer benefits for the input distributions 
and input sizes that we have used for testing still use it because of the 

better worst-case memory usage of a compressed trie. Consequently, we suffer 
from slightly more complex logic while searching. 

We have also chosen to divide the key representation into equal-sized chunkfl 
i.e., each trie internal node has the same branching factor. For specific distribu- 
tions, this may not be the best thing to do: for example, it may be better to 
have a larger branching factor near the root and a smaller branching factor near 
the leaves Q. By choosing a fixed branching factor we aim to ensure that the 
implementation is not tailored to a particular distribution (we repeat here that 
the distribution of the representatives may be a non-standard one). 

Finally, since our trie is used for predecessor queries, we need to maintain 
a dynamic predecessor data structure at each internal node in the trie. This 
predecessor data structure would typically store the first symbol of the edge 
label of each edge leaving that internal node. We currently use a bit-vector 
based data structure, combined with looking up small tables, for this purpose. 
For ‘reasonable’ values of the branching factor like 256 or 2048 this data structure 
seems quite fast. Finally, we maintain the leaf nodes in a doubly linked list (useful 
for predecessor queries). 



The collection of buckets class The collection of buckets class template 
takes three arguments: the key type, the information type and the class that 

The sizes may vary slightly due to rounding. 
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implements each bucket. It takes care of searching in buckets and ensures that 
the bucket sizes satisfy some invariants. The class template which embodies the 
notion of ‘bucket’ must be derived from the empty Bucket class. This absolves 
the top-level data structure from knowing the precise implementation of the 
bottom level; the top-level can refer to pointers to the ‘bucket’ simply using the 
type Bucket *. 

The collection of buckets implements dynamic predecessor operations, but 
with an additional parameter, for example: 

void locate_pred (Bucket* b, key_type k, key_type fepred, 
info_type &pred_inf o) ; 

returns the predecessor of k in pred, and the information associated with pred 
in pred_inf o. If the predecessor is not defined these values are undefined as well. 
The search should be performed in the bucket b. Since an insertion may cause a 
bucket to split, the signature of an insert into a collection of buckets is as follows: 

void insert (Bucket* b, key_type k, 

info_type i, key_type fesplitkey, Bucket* fenewbucket) ; 

The new representative and the new bucket pointer are returned through 
splitkey and newbucket respectively (if an insertion caused a bucket to split.) 
Since a deletion may cause a bucket to join with a neighbour, the signature of a 
delete from a collection of buckets has extra parameters as follows: 

bool del(Bucket* b, key_type k, key_type & top_delete) ; 

If a delete from the bucket b causes it to ‘join’ with an adjacent bucket then 
one representative key must be deleted from the top level: it is returned in the 
parameter top_delete. 

We have identified two ways of implementing a collection of buckets. The first 
is randomised, where each key is chosen to be a representative independently with 
some probability p. A bucket consists of the keys between a representative and 
the next larger representative. Clearly, the average bucket size is 6 = 1 /p. One 
way of implementing this is by storing all the keys in a skip list, and choosing a 
level I so that the probability that a given key is at level I is approximately 1 /b. A 
representative then is simply an element at level I or greate|J A newly-inserted 
key is a representative with probability about 1 /b and a deletion causes a change 
in the top-level data structure only if the key deleted is a representative, which 
happens only with probability about 1 /b. Hence a sequence of n updates leads to 
an expected 0(n/b) changes to the top level, as desired. We have implemented 
this idea and call it SkipBC below. Due to the parameter choices in the underly- 
ing skip list implementation, we can only support the average bucket sizes which 
are powers of 4. 

The second is more traditional, where the buckets are dynamic predecessor 
data structures (usually search trees) that allow efficient splits and joins. We 

Actually, we can limit the skip list to level I and dispense with levels I -|- 1 onwards. 
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have created a generic implementation of such a bucket collection into which 
implementations of various search trees can be incorporated. The bucket sizes 
are maintained in one of two ways (b is the maximum bucket size): 



— an eager method, where buckets are always required to have sizes in the 
range [6/2J to 5 — 1. An insertion can cause a bucket to have more than 5—1 
elements: if so, this bucket is split into two buckets of equal size and a new 
representative is added. A deletion can cause a bucket to have fewer than 
[5/2J elements: if so, it is joined with an adjacent bucket and a representative 
is deleted. Furthermore, if the resulting bucket has 5 or more elements, it is 
split into two of equal size and a new representative is added. 

— a lazy method, where the invariants are that each bucket should have between 
1 and 5—1 elements, and that for every pair of adjacent buckets B and i?', 
\B\ + \B'\ > [5/2J. An insertion can cause a bucket to have more than 5—1 
elements: if so, it is split into two equal buckets and a new representative is 
added. A deletion can empty a bucket: if so, the empty bucket is removed and 
its representative is deleted. A deletion can also cause two adjacent buckets 
to have have fewer than [5/2J elements in total: if so, they are joined and a 
representative is deleted. It is easy to verify that the invariants are restored. 

It should be noted that to maintain good worst-case performance, the eager 
method should have a ratio of more than 2 between the minimum and maximum 
bucket sizes. Also note that the lazy method has a cleaner interface, as a deletion 
can only cause one change in the top level. There are currently three specific 
instantiations of this generic bucket collection: 

— Using a traditional implementation of splay trees (with bottom-up splaying) . 
We refer to this as SplayBC below. Searches, insertions and deletions which 
do not cause a split are handled in 0(log5) amortized time. Splits and joins 
take 0(log5) amortised time as well. 

— Using a traditional implementation of red-black trees. We refer to this as 
RedBlackBC below. Searches, insertions and deletions which do not cause a 
split or join are handled in 0(log5) time. Joins take 0(log5) time but splits 
take 0( (log 5)2) timj 

— Using sorted arrays and binary search. Searches take 0(log5) time but in- 
sertions take 0(5) time. Joins and splits also take 0(5) time. All constants 
are quite small. We refer to this as ArrayBC below. 

A few points are worth noting: as the number of splits and joins in a sequence 
of n operations on the hybrid should be 0{n/b), the cost of splits and joins are 
essentially negligible in all the above implementations. Also, the ArrayBC may 
not be able to support some operations that the other implementations can, such 
as deleting a key in 0(1) time given a pointer to it. 

^ This is because we recompute black heights 0(log6) times during a split. 
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3 Experiments 

We have performed a series of experiments evaluating the effect of the different 
parameters on performance, which are summarised below. All experiments were 
performed on a Sun Ultra-II machine with 2 x 300 MHz processors and 512MB 
of memory. It has a 16KB LI data cache and 512KB L2 data cache; both are 
direct-mapped. The LI cache is write-through and the L2 cache is write-back. 
As mentioned above the compiler used is g++ version 2.8.1. The LED A speeds 
are based on LEDA 3.7 and for the SkipBC we also make use of the LEDA 3.7 
memory manager. 



Data types We considered 32-bit integer (int) and 64-bit floating-point keys 
(double). The floating-point keys were generated uniformly using the function 
drand48(). The integer keys were chosen from one of two distributions uniform 
or biased-bit. In the uniform distribution, we choose an integer from the full 
range of integers (—2^^ to 2^^ — 1 in our case) using the function mrand48(). 
The biased-bit distribution generates integers where each bit — including the sign 
bit — is chosen to be 0 independently with probability p (we use p = 0.25 here). 
The searches were overwhelmingly for keys which were not in the data structure 
(unsuccessful searches) . This is important because some data structures such as 
the trie perform successful searches very quickly. 

The searches were were generated in two ways. In the first (uniform access) 
we generated a series of random keys from the same distribution as the keys 
already in the data structure, and used these keys as the search keys. In the 
second (pseudo-zipf access), we generate unsuccessful searches that induce an 
approximation to the zipf distributiorflon the keys in the data structure. This is 
done as follows: we generate n keys yi, ... from the same distribution as the 
n keys currently in the data structure, and and generate a sequence of searches, 
where each search takes an argument drawn from yi, ... ,yn according to the zipf 
distribution. In both cases, as the size of the data structure is small compared 
to the range of possible inputs, most searches will be unsuccessful. 



Data structures We tested the following data structures: Hybrid (ARRAY), 
Hybrid(RB), Hybrid(SKIP) and Hybrid (SPLAY), which are hybrids of the 
trie with ArrayBC, RedBlackBC, SkipBC and SplayBC respectively; Leda(RB), 
which is the LEDA sorted sequence with red-black tree implementation parame- 
ter and Trie, which is the raw trie. After preliminary experiments (not reported 
here) the branching factor of the trie was fixed at 256. The bucket size was 
fixed at 48 — this choice ensures that the size of the top-level trie is no more 
than twice the size of a search tree with n nodes, if a bucket on average is 

® Recall that the zipf distribution states that if keys xi, . . . , Xn, are present in the 

data structure, then Xi should be accessed with probability . , where Hn = 

ELrV-O 
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half-full. For SkipBC, the average bucket size is 64, which is the closest approx- 
imation to 48. Unless mentioned otherwise, Hybrid(ARRAY), Hybrid(RB) and 
Hybrid(SPLAY) should be considered as implementing lazy bucket rebalancing. 



Test setup The main series of tests aimed to determine the performance of 
the various hybrid data structures under the standard metrics of run-time and 
memory usage. Starting from an empty data structure, the following operations 
were performed in turn: 

(A) n inserts, 

(B) a mixed sequence of 2n inserts/deletes, where each operation is an insert or 
delete with probability 0.5, 

(C) n predecessor queries, generated as described above. 

The time for each of these was measured, as was the memory required by the 
top-level data structure, plus any ‘overheads’ incurred by the bucket collection, 
such as space to store the sizes of buckets etc. The amount of memory required 
was measured both after step (A) and step (B). Each test was performed twice 
using the same random number seed, taking the minimum as the running time 
for this seed; the running times reported are the average over five seeds. We 
tested ints under three scenarios: uniform input and uniform access, biased 
input and uniform access and uniform input and pseudo-zipf access. We tested 
doubles only with uniform input and uniform access. 

We have performed preliminary experiments which evaluate the effectiveness 
of various bucket sizes and branching factors; the results of these are not reported 
here but are reflected in the choices of parameters. We have also compared the 
effectiveness of the lazy versus the eager strategies for bucket rebalancing. Since 
the lazy and eager strategies are the same if we are doing only insertions, we 
perform the test primarily on a mixed sequence of inserts and deletes, i.e., in 
(A) to (C) above, we replace the 2n in (B) with 8n, the idea being to reach a 
‘steady state’ which was more representative of the mixed sequence. 

4 Experimental Results and Conclusions 

Main experiments Figs^^summarise the performance of the various data 
structures considered on uniform data and uniform access. The running times 
are per operation, and are given in microseconds. The update time reported in 
FigsOandQis the mean of the update times from (A) and (B). Some conclusions 
may be drawn: 

1. Hybrid(ARRAY) is a clear winner, as it nearly matches the performance of 
the raw trie for searches, but is far superior for updates. 

2. The hybrid running times seem to grow roughly logarithmically, but not 
very smoothly. The trie search times grow logarithmically and smoothly for 
ints but not as smoothly for doubles. The trie update times seem to grow 
irregularly, but again there is a difference between ints and doubles. 
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3. All hybrids outperform Leda(RB) for doubles. 

4. Hybrid(SPLAY) outperforms Hybrid(SKIP). The SkipListBC is obtained by 
a minor modification of the LEDA skip list code. By itself, the LEDA skip 
list code is 1.5 to 2 times faster than the splay tree code used by SplayBC. 

In point (2) above, the difference between doubles and ints is due to the 
fact that uniformly distributed doubles do not induce a uniform distribution on 
the bit-strings used to represent them. In fact, following the IEEE 754 standard 
o, the eleven most significant bits of a random double in the range [0.0, 1.0) 
have the value 01111111111 with probability 1/2. 

The non-linear variation of update time with input size appears to be due 
to the fact that for random tries of certain size ranges, updates cause internal 
nodes to be added/deleted infrequently. Since adding an internal node in a trie is 
expensive, insertions/deletions to a random trie in this size range are quite cheap. 
It also appears to be plausible that two B-aiy tries, one with k random keys and 
the other with kB, should have similar (relative) update costs. Since there is a 
factor of 256 between our smallest and largest input sizes, we expect this pattern 
to repeat. A more detailed explanation of this phenomenon is deferred to the 
full version of this paper. However, we note here that: 

— The effect is far less pronounced in the hybrid data structures than in the 
raw trie (this is because the vast majority of updates do not change the trie); 

— This affects the memory usage of the hybrid. The table below shows the 
memory used by the trie, plus any ‘overheads’ incurred by the bucket col- 
lection, for Hybrid( ARRAY) and Hybrid(SKIP), all expressed as bytes per 
key. It is interesting to note that the variation in the Hybrid(SKIP) is of a 
similar magnitude to that of the Trie. This is to be expected, as the top- 
level of Hybrid(SKIP) stores keys drawn from the same distribution as the 
trie. However, Hybrid(ARRAY) stores more precisely spaced values from the 
current set and shows sharper variations from one value of n to the next. 
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32K 


64K 


128K 


256K 


512K 


IM 


2M 


4M 


Hybrid(ARRAY) 


17.5 


10.4 


6.0 


3.8 


2.7 


2.2 


2.0 


7.4 


17.1 


Hybrid(SKIP) 


5.7 


6.1 


4.8 


3.2 


2.3 


2.3 


2.9 


4.0 


5.4 


Trie 


160.5 


233.4 


321.1 


357.3 


284.1 


179.0 


126.3 


121.9 


NA 



It is not clear how to explain point (4) . The difference in average bucket sizes 
(about 24 for the splay tree vs. about 64 for the skip list) is a factor, but as 
searches are logarithmic this should not matter much. We believe that a splay 
tree is slow for large input sizes largely because of its poor use of the memory 
hierarchy; a small splay tree may not suffer to the same extent. 

FigureQ shows that Hybrid(ARRAY) maintains its performance advantage 
for biased-bit keys with uniform access as well as for uniform keys with pseudo- 
zipf access. In Figure Qb) it can be seen that Hybrid(SPLAY) is competitive 
with Hybrid (RB); this is probably because the splay trees adjust to the non- 
uniform access patterns. Since the zipf distribution is widely regarded as a good 
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Fig. 3. Non-uniform data: (a) biased-bit integers with uniform access (b) uniform 
integers with pseudo-zipf access. The trie exceeded available memory at n = 2^^ 
in (a). 
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model for real-world access frequencies, Hybrid (SPLAY) may perform well in 
the real world. 

Finally, we compare the lazy and eager methods of bucket maintenance, 
using the test described at the end of the previous chapter. The table below 
summarises the experimental results from this test. It shows the average bucket 
fullness at the end of a sequence of n inserts, as well as after a mixed sequence 
of 8n inserts and deletes. For the lazy method, for bucket sizes 48 and 64, it 
shows the search times and memory used by the top-level and auxiliary data. 
For the eager method, the bucket fullness is the same as the lazy method after 
n inserts, and is not shown. We do show, however, the bucket fullness after a 
mixed sequence of 8n inserts and deletes, as well as the search time and memory 
used for a bucket size of 48. 
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0.66 


0.52 


19.08 


1.59 


15.35 


1.46 


0.65 


16.80 


1.34 


32768 


0.66 


0.52 


10.60 


1.75 


10.08 


1.71 


0.65 


10.19 


1.83 


65536 


0.66 


0.52 


6.27 


2.20 


5.73 


2.08 


0.65 


5.84 


1.98 


131072 


0.66 


0.52 


4.11 


2.18 


3.58 


2.18 


0.65 


3.68 


2.14 


262144 


0.66 


0.52 


3.02 


2.53 


2.49 


2.56 


0.65 


2.59 


2.47 


524288 


0.66 


0.52 


2.49 


2.62 


1.96 


2.62 


0.65 


2.06 
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0.52 


2.45 
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2.91 



We note that a sequence of inserts leads to buckets being about 2/3 full. The 
lazy method only manages a fullness of about 1 /2 with the mixed sequence, while 
the eager method has a fullness of about 2/3 again. We do not show this here, 
but the lazy method also processes the mixed update sequence more rapidly than 
the eager method (by virtue of performing fewer operations on the top level). 
However, it is more wasteful of memory — in the ArrayBC it would require about 
33% more memory in the bucket collection. One could reduce the top-level and 
auxiliary space requirements of the lazy method by choosing a larger bucket 
size. In the table, we demonstrate that the lazy method with a bucket size of 
64 essentially reduces top-level and auxiliary memory, as well as search time, to 
the level of the eager method. Hence, in a SplayBC or RedBlackBC one would 
prefer the lazy method. 



5 Conclusions and Future Work 

We have demonstrated that hybrid data structures which combine tries with 
buckets based on arrays can deliver excellent performance while maintaining 
a degree of genericity. A lot more work needs to be done before one can be 
prescriptive regarding the best choices of the various parameters. Here we have 
shown that parameter choices set using worst-case assumptions about the input 
distributions give surprisingly good performance. We believe that this, rather 
than tuning based on various distributions, may be the best approach towards 
parameter selection for very general-purpose data structures. 
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Abstract. During the last years, many software libraries for in-core 
computation have been developed. Most internal memory algorithms 
perform very badly when used in an external memory setting. We intro- 
duce LEDA-SM that extends the LEDA-library towards secondary 
memory computation. LEDA-SM uses I/O-efhcient algorithms and data 
structures that do not suffer from the so called I/O bottleneck. LEDA is 
used for in-core computation. We explain the design of LEDA-SM and 
report on performance results. 



1 Introduction 



Current computers have large main memories, but some applications need to 
manipulate data sets that are too large to fit into main memory. Very large data 
sets arise, for example, in geographic information systems [^, text indexing Q, 
WWW-search, and scientific computing. In these applications, secondary mem- 
ory (mostly disks) provides the required workspace. It has two features that 
distinguish it from main memory: 

— Access to secondary memory is slow. An access to a data item in external 
memory takes much longer than an access to the same item in main memory; 
the relative speed of a fast internal cache and a slow external memory is close 
to one million. 

— Secondary memory rewards locality of reference. Main memory and sec- 
ondary memory exchange data in blocks. The transfer of a block between 
main memory and secondary memory is called an I/O- operation (short I/O). 



Standard data structures and algorithms are not designed for locality of reference 
and hence frequently suffer an intolerable slowdown when used in conjunction 
with external memory. They simply use the virtual memory (provided by the 
operating system) and address this memory as if they would operate in internal 
memory, thus performing huge amounts of I/Os. In recent years the algorithms 
community has addressed this issue and has developed I / O-efhcient algorithms 
for many data structure, graph-theoretic, and geometric problems 
Implementations and experimental work are lacking behind. 
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External memory algorithms move data in the memory hierarchy and process 
data in main memory. A platform for external memory computation therefore 
has to address two issues: movement of data and co-operation with internal 
memory algorithms. 

We propose LEDA-SM (LEDA secondary memory) as a platform for external 



memory computation. It extends LEDA 



to secondary memory computa- 



tion and is therefore directly connected to an efficient internal-memory library of 
data structures and algorithms. LEDA-SM is portable, easy to use and efficient. 
It consists of: 



— a kernel that gives an abstract view of secondary memory and provides a 
convenient infrastructure for implementing external memory algorithms and 
data structures. We view secondary memory as a collection of disks and each 
disk as a collection of blocks. There are currently four implementations of 
the kernel, namely by standard I/O (stdio), system call I/O (syscall), 
memory mapped I/O (mmapio) and memory disks (memory). All four im- 
plementations are portable across Unix-based (also Linux) platforms, and 
stdio and syscall are also for Microsoft operating systems. 

— a collection of external memory data structures. An external memory data 
structure offers an interface that is akin to the interface of the correspond- 
ing internal memory data structures (of LEDA), uses only a small amount of 
internal memory, and offers efficient access to secondary memory. For exam- 
ple, an external stack offers the stack operations push and pop, requires only 
slightly more than two blocks of internal memory, and needs only 0(1/ B) 
1/ 0-operations per push and pop, where B is the number of data items that 
fit into a block. 

— algorithms operating on these data structures. This includes basic algorithms 
like sorting as well as matrix multiplication, text indexing and simple graph 
algorithms. 

— a precise and readable specification for all data structures and algorithms. 
The specifications are short and abstract so as to hide all details of the 
implementation. 



The external memory data structures and algorithms of LEDA-SM (items (2) 
and (3)) are based on Ere kernel; however, their use requires little knowledge 
of the kernel. LED A-S1V|J supports fast prototyping of secondary memory algo- 
rithms and therefore can be used to experimentally analyze new data structures 
and algorithms in a secondary memory setting. 



The database community has a long tradition of dealing with external mem- 
ory. Efficient index structures, e.g, B-trees Q and extendible hashing have 
been designed and highly optimized implementations are available. “General 
purpose” external memory computation has never been a major concern for the 
database community. 

In the algorithms community implementation work is still in its infancy. 
There are implementations of particular data structures !'llll21) ^ and there is 



LEDA-SM can be downloaded from www.mpi-sb.mpg.de/~crauser/leda-sm.html 
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TPIE E3, the transparent 1/ 0-environment. The former work aimed at investi- 
gating the relative merits of different data structures, but not at external memory 
computation at-large. TPIE provides some external programming paradigms like 
scanning, sorting and merging. It does not offer external memory data struc- 
tures and it has no support for internal memory computation. Both features 
were missed by users of TPIE it is planned to add both features to TPIE 
(L. Arge, personal communication, February 99). Another different approach is 
ViC* the Virtual Memory C* compiler. The ViC* system consists of a com- 
piler and a run-time system. The compiler translates C* programs with shapes 
declared outof core, which describe parallel data stored on disk. The compiler 
output is a program in standard C with I/O and library calls added to efficiently 
access out-of-core parallel data. At the moment, most of the work is focussed on 
out-of-core fast Fourier transform, BMMC permutations and sorting. 

This paper is structures as follows. In Section 2 we explain the software 
architecture of LEDA-SM and discuss the main layers of LEDA-SM (kernel, 
data structures and algorithms). We describe the kernel and give an overview 
of the currently available data structures and algorithms. In Section 3 we give 
two examples to show (i) how the kernel is used for implementing an external 
data structure and (ii) the ease of use of LEDA-SM and its with LEDA. Some 
experimental results are given Section 4. We close with a discussion of future 
developments. 

2 LEDA-SM 

LEDA-SM is a C++ class library that uses LEDA for internal memory computa- 
tions. LEDA-SM is designed in a modular way and consists of 4 layers: 



Layer 


Major Classes 


algorithms 


sorting, graph algorithms, . . . 


data structures 


exEstack^ exEarray, . . . 


abstract kernel 


block<E>, BJD, UJD 


concrete kernel 


exEmemoryjmanager , 
exEdisk, extJreeJist, 




nameserver 



We use application layer for the upper two layers and kernel layer for the 
lower two layers. 

The kernel layer of LEDA-SM manages secondary memory and movement 
of data between secondary memory and main memory. It is divided into the 
abstraet and the conerete kernel. The abstract kernel consists of 3 C++ classes 
that give an abstract view of secondary memory. Two classes model disk block 
locations and users of disk blocks while the third class is a container class that is 
able to transfer elements of type E to and from secondary memory in a simple 
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way. This class provides a typed view of data stored in secondary memory, data 
on disk is always untyped (type void*). 

The concrete kernel is responsible for performing I/Os and managing disk 
space in secondary memory and consists of 4 C++ classes. LEDA-SM provides 
several realizations of the concrete kernel; the user can choose one of them at run- 
time. The concrete kernel defines functions for performing I/Os and managing 
disk blocks, e.g. read/write a block, read/write k consecutive blocks, allocate/free 
a disk block etc. These functions are used by the abstract kernel or directly by 
data structures and algorithms. 

The application layer consists of data structures and algorithms. LEDA is 
used to implement the in-core part of the applications and the kernel of LEDA- 
SM is used to perform the I/Os. The physical I/O calls are completely hidden 
in the concrete kernel, the applications use the container class of the abstract 
kernel to transport data to and from secondary memory. This makes the data 
structures simple, easy to use and still efficient. 

We now explain in more detail the kernel layer. Examples for the application 
layer are given in Section 2.2. 



2.1 The Kernel 

The abstract kernel models secondary memory as files of the underlying file 
system. Secondary memory consists of NUM_OF_DISKS logical disks (files 
of the file system), max-blocks[d\ blocks can reside on the d-th disk, 0 < 
d <NUM_OF_BLOCKS . The blocks on any disk are numbered consecutively 
starting at zero. A block identifier is a pair (d, num) of integers. A block identi- 
fier is called valid if 0 < d <NUM_OF_DISKS and 0 <num<max-blocks[d\ and it 
is called active if it is valid and the block denoted by it was written to. The class 
BJD realizes block identifiers. Observe that block identifiers refer to physical 
objects, namely, to regions of storage on disk. In the remainder of this section 
there is the need to distinguish between blocks as physical objects (= a region 
of storage on disk) and blocks as logical objects (= a bit pattern of a particular 
size). We will use the word disk block for the physical object and reserve the 
word block for the logical object. The disk blocks are managed by the external 
memory manager (class ext-memoryjmanager) . There is only one instance of 
this class. The external memory manager can be asked to allocate disk blocks, 
to free disk blocks, and to transfer blocks between main memory and external 
memory. The allocation of a disk block is either on a disk chosen by the user or 
on a disk chosen by the system (if no disk is specified in the allocation request). 
The return value of an allocation request is a block identifier which can later 
be used in read- and write-operations. An allocated disk block is always owned 
by a particular user. Only the owner of a disk block can write the disk block. 
A user is identified by an user identification (= an integer) of class UJD and 
user identifiers are managed by class nameserver. We use user identifiers for 
memory protection. Every instance of a data structure is a different user of the 
kernel and hence data structures are protected against one-another. There are 
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different ways how these user checks can be performed: a conservative approach 
checks during read-and write accesses to ensure that even no false data is read, 
while the standard approach only checks user-ids during write access. 

The parameterized type block<E> is used to store logical blocks in internal 
memory. An instance B of type hlock<E> can store one logical block and gives 
a typed view of logical blocks. A logical block is viewed as two arrays: an array 
of links to disk blocks and an array of variables of type E. A link is of type 
block identifier] the number num_ofJ)ids of links is fixed when a block is created. 
The number of variables of type E is denoted by blk_sz and is equal to {BLK_SZ * 
sizeof {GenPtr) 

— num-ofJ)ids * sizeof {B_ID)) / sizeof (E) where GenPtr is the generic pointer 
type of the machine (usually void*) and BLK_SZ is the physical disk block size 
in bytes. Both arrays are indexed starting at zero. 

Every block has an associated user identifier and an associated block identi- 
fier. The user identifier designates the owner of the block and the block identifier 
designates the disk block to which the logical block is bound. The container class 
block<E> is directly connected to functions of the concrete kernel (by use of the 
class extjmemoryjmanager) , i.e. function write of class block<E> uses the write- 
function of the concrete kernel to initiate the physical I/Os. 

The concrete kernel is responsible for performing I/Os and managing 
disk space in secondary memory and consists of classes exLmemory_manager , 
ext-disk, exEfreeJist, nameserver . The class exEdisk is responsible for file I/O 
and models an external disk device. There are several choices for file I/O: system 
call I/O, standard file I/O, memory mapped I/O and memory I/O (this is an 
in-memory disk simulation) . Each of these methods has different advantages and 
disadvantages. As we model disk locations by an own class (B JD), we do not ig- 
nore the ordering of disk blocks and are able to manage sequential disk accesses. 
Furthermore, there also exists the possibility to switch to the raw disk driv^ 
(underlying the file system). By this, we simply drop the overhead introduced 
by the file system layer and handle the problem of non-contiguous disk blocks 
of files, on the other hand we also loose the caching effects that the file sys- 
tem performs. This feature is not portable across all platforms and is therefore 
not directly implemented in the librarj^ The class exEfreelist is responsible for 
managing allocated and free disk blocks in external memory. Disk blocks can be 
free or in-use by a dedicated user, the system must manage this. This is done by 
a specific data structure which is currently implemented in four different ways. 

We can now summarize the software layout of LEDA-SM. The concrete kernel 
consists of class extjmemoryjmanager , class nameserver and of the two kernel 
implementation classes for disk access and for disk block management. The ab- 
stract kernel consists of classes block, B_ID and UJD. All data structures and 
algorithms of LEDA-SM are implemented using only the kernel classes and data 

^ This is the very reason why we model disk positions by an own class. 

® In Solaris systems, this is just a change of a few code lines, but it also requires 
super-user rights. 
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structures or algorithms available in LEDA. In the next subsection we will give 
an overview of the currently available data structures and algorithms. 



2.2 Data Structures and Algorithms 



We survey the data structures and algorithms currently available in LEDA-SM. 
Theoretical I/O-bounds are given in the classical external memory model of 
where M is the size of the main memory, B is the size of a disk block, and N 
is the input size measured in disk blocks. For sake of simplicity we assume that 
D,the number of disks, is one. 



Stacks and Queues: External stacks and queues are simply the secondary 
memory counterpart to the corresponding internal memory data structures. Op- 
erations push, pop and append are implemented in optimal 0(1/ B) amortized 
I/Os. 



Priority Queues: Secondary memory priority queues can be used for large-scale 
event simulation, in graph algorithms or for online-sorting. Three different imple- 
mentations are available. Buffer trees Q achieve optimal 0(l/i?log^/g(A/i3)) 
amortized I/Os for operations insert and deLmin. Radix heaps are an extension 
of Q towards secondary memory. This integer-based priority queue achieves 
0{1/B) I/Os for insert and 0(1/ Blogj^/ ^{C)) I/Os for deLmin where C is 
the maximum allowed difference between the last deleted minimum key and the 
actual keys in the queue. Array heaps achieve 0(1/5 logjy^/^(A/i?)) 

amortized I/Os for insert and 0(1/5) amortized I/Os for deLmin. 



Arrays: Arrays are a widely used data structure in internal memory. The 
main drawback of internal-memory arrays is the fact that when used in sec- 
ondary memory, it is not possible to control the paging. Our external array 
data structure consists of a consecutive collection of disk blocks and an internal- 
memory data structure of fixed size that implements a multi-segmented cache. 
The caching allows to control the internal-memory usage of the external array 
data structure. As the cache is multi-segmented, it is possible to index different 
regions of the external array and use a different segment of the cache for each 
region. Several page-replacement strategies are supported like LRU, random, 
fixed, etc. The user can also implement his/her own page-replacement strategy. 
Sorting: Sorting is implemented by multiway-mergesort. Internal sorting dur- 
ing the run-creation phase is realized by LEDA-quicksort which is a fast and 
generic code-inlined template sorting routine. Sorting N items takes optimal 
0{N/B\ogM/B{N/B))l/Os. 

B- Trees: B-Trees Q are the classical secondary memory online search trees. We 
use a 5+-implementation and support operations insert, delete, deletejmin and 
search in optimal 0{log^{N)) I/Os. 

Suffix arrays and strings: Suffix arrays EB S'l'e a full-text indexing data 
structure for large text strings. We provide several different construction al- 
gorithms ^3 for suffix arrays as well as exact searching, 1- and 2-mismatch 
searching, and 1- and 2-edit searching. 
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Matrix operations: We provide matrices with entry type double. The opera- 
tions -I-, — , and * for dense matrices are realized with optimal 1/ 0-bounds 



Graphs and graph algorithms: We provide a data type external graph and 
simple graph algorithms like depth-first search, topological sorting and Dijkstra’s 
shortest path computation. External graphs are static, i.e. graph updates are 
expensive. 



2.3 Further Features 

Secondary-memory data structures and algorithms also use internal memory. 
LEDA-SM allows the user to control the amount of memory that each data 
structure and/or algorithm uses. The amount of memory is either specified at 
the time of construction of the data structure or it is an additional parameter 
of a function call. If we look at our stack example of Section 3 we see that the 
constructor of data type ext_stack has a parameter a, the number of blocks of 
size blk-size that are held in internal memory. We therefore immediately know 
that the internal memory space occupancy is a ■ blzsize -I- 0(1) bytes. 

The second feature of LEDA-SM is accounting of I/Os. The kernel supports 
counting read- and write operations. Some of the reads may be logical and can 
be served by the buffer of the operating system. It is also possible to count 
consecutive I/Os (also introduced by as bulk I/Os). This allows the user 

to experimentally classify the I/O-structure of algorithms and in this way to 
compare algorithms with the same asymptotic 1/ 0-complexity (see also |y|). 

3 Examples 

We discuss the implementation of secondary memory stacks and secondary mem- 
ory graph search. The first example shows how the kernel of LEDA-SM is used 
to implement data structures and algorithms. The second example shows that 
secondary memory algorithms can be coded in a natural way in LEDA-SM and 
LEDA. It also shows the interplay between LEDA and LEDA-SM. 



3.1 External Memory Stacks 

We discuss the implementation of external memory stacks. It is simple, concep- 
tually and programming- wise. The point of this section is to show how easy it 
is to translate an algorithmic idea into a program. 

A external memory stack S for elements of type E (exLstack<E>) is realized 
by an array (a LEDA data structure) A of 2a blocks of type block<E> and a 
linear list of disk blocks. Each block in A can hold blk_sz elements, i.e, A can 
hold up to 2a ■ blk_sz elements. We may view A as a one-dimensional array of 
elements of type E. The slots 0 to top of this array are used to store elements 
of S with top designating the top of the stack. The older elements of S, i.e., 
the ones that do not fit into A, reside on disk. We use bid to store the block 
identifier of the elements moved to disk most recently. Each disk block has one 
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link; it is used to point to the block below. The number of elements stored on 
disk is always a multiple of a ■ blksz. 

{ext-stack)= 

template <class E> 
class ext_stack 
{ 

array< block<E> > A; 

int top_cnt, a_sz, s_sz, blk_sz; 

B_ID bid; 

public : 

ext_stack(int a = 1) ; 
void push(E x) ; 

E popO ; 

E topO ; 

int sizeO { return s_sz; ]■; 
void clear () ; 

~ext_stack() ; 

}; 



We next discuss the implementation of the operations push and pop. We denote 
by a^sz = 2a the size of array A. A push operation S.push{x) writes x into the 
location top + 1 of A except if A is full. If A is full {top-cnt == a^sz * blksz — 1), 
the first half of A is moved to disk, the second half of A is moved to the first 
half, and x is written to the first slot in the second half. 

{ext-stack)+= 

template<class E> 

void ext_stack<E> : :push(E x) 

{ 

int i ; 

if (top_cnt == a_sz*blk_sz - 1) 

{ 

A[0] .bid(O) = bid; 

bid = ext_mem_mgr . new_blocks (myid, a_sz/2) ; 
block<E> : :write_cLrray (A,bid,a_sz/2) ; 
f or (i=0 ; i<a_sz/2 ; i++) 

A[i] = A[i+a_sz/2]; 
top_cnt = (a_sz/2) *blk_sz-l ; 

} 

top_cnt++; 

A [top_cnt/blk_sz] [top_cnt’/oblk_sz] = x; 
s_sz++; 



The interesting case of push is the one where we have to write the first half of 
A to disk. In this step we have to do the following: we must reserve a = o-szj"! 
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disk blocks on disk and we must add the first a blocks of array A to the linked 
list of blocks on disk. The blocks are linked by using the array (of length one) 
of type BJD of class block (see Section 2.1) and the block least recently written 
is identified by block identifier bid. The commands T[0](0) = bid creates a 
backwards linked list of disk blocks which we use during pop-operations later. 
We then use the kernel to allocate a consecutive free disk blocks by the command 
extjmem-mgr .new -blocks . The return value is the first allocated block identifier. 
The first half of array A is written to disk by calling writc-array of class block 
which tells the kernel to initiate the necessary physical I/Os. In the next step 
we copy the last a blocks of A to the first a blocks and reset top-cnt. Now the 
normal push can continue by copying element x to its correct location inside A. 

A pop operation S.pop{ ) is also trivial to implement. We read out the element 
in slot top except if A is empty. If A is empty and there are elements residing 
on disk we move a ■ blksz elements from disk into the left half of A. 

{ext-stack)+= 

template<class E> 

E ext_stack<E> : :pop() 

{ 

if (top_cnt == -1 && s_sz > 0) 

{ 

B_ID oldbid = bid; 

block<E> : : read_array (A, oldbid, a_sz/2) ; 

bid = A[0] .bid(O) ; 

top_cnt = (a_sz/2) *blk_sz - 1; 

ext_mem_mgr . free_blocks (oldbid, myid, a_sz/2) ; 

} 

s_sz — ; 
top_cnt — ; 
return 

A [(top_cnt+l)/blk_sz] [(top_cnt+l)’/,blk_sz] ; 

} 

If array A is empty {top-cnt = —I) we load a blocks from disk into the first 
a array positions of A by calling read-array. These disk blocks are identified 
by bid. We then restore the invariant that block identifier bid stores the block 
identifier of the blocks least recently written to disk. As the disk blocks are linked 
backwards, we can retrieve this block identifier from the first entry of the array 
of block identifiers of the first loaded disk block (A[0](0)). Array A now stores 
a ■ blk-sz data items of type E. Variable top-cnt is reset to this value. The just 
loaded disk blocks are now stored internally, therefore there is no reason to keep 
them again on disk. These disk blocks are freed by calling the kernel routine 
ext-mem-mgr.free-blocks. Return- value of operation pop is the top-most element 
of A. 

Operations push and pop move a blocks at a time. As the read and write 
requests for the a blocks always target consecutive disk locations, we can choose 
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a in such a way that it maximizes disk-to-host throughput rate. After the move- 
ment A is half- full and hence there are no I/Os for the next a ■ blksz stack 
operations. Thus the amortized number of I/Os per operations is 1/hlk^z which 
is optimal. Stacks with fewer than 2a • blksz elements reside in-core. 

3.2 Graph Search 

We give a simple example that shows how to use both, LEDA data structures 
and LEDA-SM data structures and how they interact. Our example computes 
the number of nodes of a graph G reachable from a source node v by using 
graph search. We assume that a bit vector for all nodes of graph G can be stored 
in internal memory. Internal-memory graph data types heavily rely on large 
linked lists or large arrays to implement the adjacency list representation of 
large graphs. As most of the graph algorithms like depth-first search, Dijkstra’s 
shortest path etc. access the graph in an unstructured way, they are slowed down 
tremendously by the “uncontrolled” paging activity of the operating system. Our 
external graph data type uses external arrays to control the paging. 

{graph search)= 

template<class T> 

long graph_search( T& G, ext_node<T> v, 
int_set Visited ) 

{ 

ext_stack< ext_node<T> > S; 
ext_edge<T> e; 
long i = 1 ; 

Visited. insert (v) ; 

S .push(v) ; 

while ( S.sizeO) 

{ 

V = S .popO ; 

f orall_out_edges (e , v, G) 

if ( ! Visited. member ( G. target (e)) ) 

{ 

Visited. insert ( G. target (e) ); 

S.pushC G.target(e) ); 
i++; 

} 

} 

return i ; 

} 



ext_graph<int , char , int , char> G; 
int_set visited (1 , G . number_of _nodes O ) ; 
ext _node<ext _graph<int , chcLT , int , char > > v; 
int reachable = graph_search( G, v, visited ); 
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The LEDA-SM graph data type exLgraph is parameterized so that algorithm- 
dependent information can be associated with nodes and edges. For efficiency, it 
is important that node and edge labels are stored directly in the nodes and edges 
and are not accessed through pointers or indices (as this would imply an I/O- 
operation for each label). Different algorithms need different labels and hence 
it is crucial that the space allocated for labels can be reused. We have chosen 
the following design: each node (and similarly edge) stores an information of an 
arbitrary type X and an array of characters; X and the size of the array are 
fixed when the graph is defined. We use the array of characters to store arbitrary 
information of fixed length (through appropriate casting) and use the field of 
type X for “particularly important” information; we could also do without it. In 
our example, each node and edge has an int and a single char associated with 
it. Edges and nodes are parameterized with the graph data type to which they 
belong. 

The example shows the interaction between LEDA and LEDA-SM. Data 
structures from both libraries are used. The bit vector Visited is implemented by 
LEDA data type inLset. Reached nodes are directly inserted into the inLset. The 
necessary type conversion from type extjnode to type int is performed by LEDA- 
SM, i.e. type exLnode can be converted to type int, because the nodes of an 
external graph are consecutively numbered. We have implemented graph_search 
in a recursion- free way by using a stack (type exLstack). We have chosen an 
external stack with minimum internal space requirement. 

The example shows that LEDA-SM and LEDA support a natural program- 
ming style and allow the user to freely combine the required data structures and 
algorithms. The I/O operations of LEDA-SM are completely hidden in the data 
and not visible in the algorithms layer. 

We conclude this section by explaining how one can eliminate the use of 
an internal bit-field with size equal to the number of nodes of G. If the main 
memory cannot store a bit-field, we use external depth-first search that proceeds 
in rounds (see [^). In each round, we store visited nodes in an internal dictionary 
(LEDA data type dictionary) of a fixed size. If the dictionary is full (no space 
left in main memory), we compact the edges of the external graph by deleting 
(marking) those edges that point to nodes that were already visited. We then 
delete the dictionary and start the next round. 

4 Performance Results 



We report on the performance of LEDA-SM. In our tests, we compare sec- 
ondary memory data structures and algorithms of LEDA-SM to their internal 
memory counterparts (of LEDA) . For more detailed results on the performance 
of LEDA-SM see o. All tests were performed on a SUN UltraSparcl/143 using 
64 Mbytes main memory and a single SCSI-disk connected to a fast-wide SCSI 
controller. All tests used a disk block size of 8 kbytes. We note that this is not 
the optimal disk block size for the disk according to data throughput versus 
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Fig. 1. Comparison of Multiway-Mergesort of LEDA-SM and quicksort of 
LEDA. 



service time. However, this disk block size allows us to compare the secondary 
memory algorithms in a fair way to internal memory algorithms in swap space 
because the page size of the machine is also 8 kbytes. All external memory al- 
gorithms and data structures get faster if we use the optimal (according to data 
throughput) disk block size of 64 or 128 kbytes. 

Figure 1 compares external multiway mergesort and LEDA quicksort. Exter- 
nal multiway mergesort uses LEDA quicksort to partition the data into sorted 
runs before merging proceeds. The sorted runs are merged using a priority queue. 
The external sorting routine uses approx. 16 Mbytes of main memory. As soon 
as the input to be sorted reaches the size of the main memory, the external 
sorting routine is faster than the internal sorting routine. The sharp bend in 
the curve of external multiway mergesort occurs because we change the priority 
queue implementation inside the merging routine. 

Figure 2 compares internal and external matrix multiplication of dense ma- 
trices of type double. The external matrix multiplication code uses tiling and 
matrix reordering (see also ^3). External matrix multiplication is faster than 
internal matrix multiplication even if the total input size is smaller than the 
main memory. This effect is due to the better cache behavior of the external 
matrix code (although it uses the same internal matrix multiplication code as 
LEDA). 

Figure 3 compares the performance of operation insert of B-Trees and 2-4 
trees. Both data structures are classical online search trees. The internal data 
structure is slowed down dramatically if it is running in swap space. The 2-4 tree 
does not exhibit locality of reference and its use of pointers leads to many page 
faults. 
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Fig. 2. Comparison of LEDA-SM matrix multiplication against LEDA matrix 
multiplication. 



Figure 4 compares the performance of operation insert of external radix 
heaps to Fibonacci heaps and radix heaps Q of LEDA. We see that the 
internal priority queues fail if the main memory size is exceeded. 

5 Conclusions 

We proposed LEDA-SM, an extension of LEDA towards secondary memory. 
The library follows LEDA’s main features: portability, ease-of-use and efhciency. 
The performance results of LEDA-SM are promising. Although we use a high- 
level implementation for the library, we are orders of magnitudes faster than 
the corresponding internal data structures. The speedup increases for larger 
disk block sizes. The performance of many secondary memory data structures is 
determined by the speed of their internally used data structures; recall that the 
goal of algorithm design for external memory is to overcome the I/O-bottleneck. 
If successful, external memory algorithms are compute-bound. LEDA-SM profits 
from the efficient in-core data structures and algorithms of LEDifl LEDA-SM 
is still growing. Future directions of research cover geometric computation C3 
as well as graph applications. Important practical experiments should be done 
for parallel disks (RAID arrays) as well as low-level disk device access. We plan 
to do both in the near future. 

LEDA’s quicksort routine was recently improved (as a consequence of our tests of 
LEDA-SM-mergesort). For user-defined data types this led to a speed-up of almost 
five. LEDA-SM’s multi way- mergesort profited immediately; its running time was 
reduced by a factor of two. 
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Fig. 3. Performance of insert- operation Fig. 4. Comparison of LEDA-SM radix 
for LEDA-SM B-Trees and 2-4-trees of heaps against LEDA’s radix-heaps and 
LEDA. Fibonacci-heaps. 
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Abstract. A transformation, referred to as depletion, is defined for 
comparison-based data structures that implement priority queue opera- 
tions. The depletion transform yields a representation of the data struc- 
ture as a forest of heap-ordered trees. Under certain circumstances this 
transform can result in a useful alternative to the original data structure. 
As an application, we introduce a new variation of skew-heaps that ef- 
ficiently implements decrease-key operations. Additionally, we construct 
a new version of the pairing heap data structure that experimentally 
exhibits improved efficiency. 



1 Introduction 

The focus of this paper centers on comparison-based priority queue data struc- 
tures. A priority queue is considered comparison-based provided that the ele- 
ments that it stores can belong to an arbitrarily chosen linearly ordered universe. 
In particular, the only operations that can be performed on these elements (by 
the data structure) are simple comparisons. A mergeable priority queue is one 
that efficiently supports the merge operation. Among such data structures can 
be found those that are represented as a forest consisting of one or more heap- 
ordered trees, and moreover, have the property that comparisons take place only 
among tree roots in the forest as operations are performed. Examples include 
binomial queues 0, Fibonacci heaps and the various forms of pairing heaps 
EF1 . We shall refer to such data structures as forest-based priority queues. Al- 
though the context of our discussion concerns mergeable priority queues, this is 
not an essential aspect. 

Let Q consist of a collection of individual priority queues, Qi, Q 2 , • ' Assume 
that the priority queue operations being supported consist of merge(Qi,Qj), 
which replaces Qi with the merged result (removing Qj in the process), 
deletemin(Qi), and the operation make-queue (Qfc,a;), which creates a priority 
queue Qk containing the single value x. Insertion is defined in terms of make- 
queue and merge. The depletion transform of Q, defined next, yields a forest- 
based representation of Q. As priority queue operations take place, the deple- 
tion transformation maintains a collection S of corresponding shadow structures. 
Si, 52, • • •, where Si is referred to as the shadow of Qi and consists of a forest 
of heap-ordered trees that store the values contained in Qi. Associated with 
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each Si is a queue Ci consisting of comparison operations, with each compari- 
son in Ci having an associated time-stamp. The comparisons in Ci are arranged 
in order by their time-stamps, with the back of Ci containing the comparison 
having the most recent time-stamp. The role of the comparison queues will be 
described shortly. As operations take place with Q, the depletion transform ma- 
nipulates S and the associated comparison queues Ci as follows. When executing 
merge{Qi,Qj), the forests of the respective shadows Si and Sj are combined and 
replace Si. Moreover, the comparison queues Ci and Cj are merged (respecting 
time-stamp order) and replace Ci. When executing make-queue (Qfc, a;), a forest 
Sk consisting of one node that stores the value x is created, and Ck is set to 
be the empty queue. When deletemin(Qi) is performed, the corresponding tree 
root in Si is deleted, and the subtrees of this root become new trees in the forest 
of Si. As comparisons are performed involving values in Qi, each such compari- 
son is inserted into Ci with its time-stamp set to the current time. We refer to 
mergings, deletemins, and individual comparisons as events. 

We now describe the manner in which linkings are established in the shadow 
S. After each event takes place, a given queue Ci is processed as follows. Assume 
that their exists in Ci a comparison that involves two tree roots in Si, and let c 
be the one having the oldest time-stamp (closest to the front of Ci). Then the 
corresponding trees in Si are linked to reflect the comparison c (the root which 
“loses” the comparison becomes the leftmost child of the root which “wins” 
the comparison) and c is removed from Ci. The process is repeated until Ci no 
longer contains any applicable comparisons. Informally, the depletion transform 
defers a comparison until the two items being compared each become tree roots, 
at which point the comparison get executed (provided that there are no earlier 
comparisons free to be executed) . As will become clear, our definition of depletion 
is strictly conceptual, and will not in itself constitute a performance issue. 

The following figures illustrate depletion, highlighting the deletemin opera- 
tion. Figure^show a priority queue Q, its shadow S and associated comparison 
queue C; and flgureQshows the result of executing deletemin. Referring to figure 
Q upon executing the deletemin operation, treating Q as a skew heap, we And 
that the comparisons (7 : 4) and (7 : 10) get executed, and these comparisons 
get appended to the back of C. Now with respect to S, the removal of its root 
permits execution of the linkings (5 : 4) and (7 : 8) from C, as well as the link- 
ing corresponding to the newly appended comparison (7 : 4), since the nodes 
that hold these values have become tree roots (and remain so at the respective 
moments that these linkings take place). The two remaining comparisons (5 : 9) 
and (7 : 10) in C are still deferred. The structures are thus modified as shown 
in figure 0 

The depletion transform typically loses information; some of the comparisons 
in the Ci’s never get executed. When this happens the shadow S is not a viable 
representation of Q as an independent data structure. However, under appro- 
priate circumstances no such loss occurs, and moreover, the manipulations of 
S, as induced by the operations of Q, can be inferred without maintaining any 
structure beyond that represented by the Si’s. 
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C = (5:4)(7:8)(5:9) 



Fig. 1. A priority queue and corresponding shadow structure 




Fig. 2. Corresponding structures following deletemin execution 

When this is the case, we refer to Q as being faithfully depletable, and its 
shadow structure S can enjoy an independent existence. The depletion trans- 
formation, in such instances, results in a forest-based data structure enjoying 
the property that its amortized costs (measured in comparisons) do not exceed 
those of the original data structure Q. Moreover, decrease-key operations are 
easily implementable in the shadow data structure. (Recall that the decrease- 
key operation replaces the value of the referenced item with a smaller value.) 
Exploiting the fact that the resulting data structure is forest-based, a decrease- 
key operation is implemented by positioning the subtree of the node affected by 
the operation as a new tree in its forest (possibly just momentarily), assuming 
that it is not already a tree root; the same technique used for Fibonacci heaps 
Q and pairing heaps Q. 

In the following section we consider two examples of depletion. The first 
provides a simple illustration of the depletion transform, and the second provides 
an application of depletion to the skew-heap data structure Q, which is shown 
to be faithfully depletable. In sectionQwe construct a new variation of pairing 
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heaps that enjoys improved efficiency, and note that our construction can be 
motivated in terms of considerations involving depletion. In section 0 we report 
experimental findings involving this new data structure. 



2 Examples 

Our first example provides a simple illustration of the depletion transform. An 
individual priority queue Q is represented as a binary tournament tree. (We do 
not require that this tree be balanced.) Data items are stored in the leaves of the 
tree and the winners of comparisons are promoted upwards (copied rather than 
moved) through the internal nodes of the tree, so that the minimum data item 
is found in the root of the tree. For the purpose of this example, we implement 
the merge operation as follows. Let Qi and Q2 be the tournament trees being 
merged. First, we establish a new root node r and link both Qi and Q2 as the 
two subtrees of r. Second, we store in r the smaller of the two values stored in 
the respective roots of Qi and Q2 (promotion). 

Finally, the deletemin operation is implemented by removing the value stored 
in the root of T (say) and in all nodes along the path leading to the leaf of T 
containing the value being deleted. (The vacated leaf gets removed from the tree 
and its parent gets replaced by the sibling of this leaf.) Then proceeding in a 
bottom-up manner, the vacated nodes along this path are refilled, promoting 
the smaller of the two values stored in the children of the node being refilled at 
each step. 

Now we consider the depletion transform applied to the above data structure. 
In this instance the resulting shadow structure is similar to a pairing heap Q, and 
there are no deferred comparisons, so that the comparison queues associated with 
the shadow structures are empty. (Proofs pertaining to this discussion are left to 
the reader.) An individual priority queue in the shadow structure is represented 
as a single heap-ordered tree, and the priority queue operations are performed 
as follows. Given two heap-ordered trees, a pairing operation on these trees 
combines them by comparing the values in their respective roots, and making 
the root which loses the comparison the leftmost child of the other root. The 
merge operation is performed by pairing the respective individual trees. 

The deletemin operation is performed by removing the root of the tree and 
then pairing the remaining subtrees in order from right-to-left; each subtree is 
paired with the result of the linkings of the subtrees to its right. We refer to 
this right-to-left sequence of node pairings as a right-to-left incremental pairing 
pass. (In contrast with the pairing heap data structure the deletemin op- 
eration omits the first (left-to-right) pairing pass encompassed by the pairing 
heap deletemin operation, performing only the second pass. Below, we will need 
to recall that the first, omitted pass, proceeds by pairing the first and second 
subtrees, then the third and fourth subtrees, etc.) 
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2.1 Application to Skew-Heaps 

We turn next to our second example, skew-heaps Q. First, we describe the 
particular variant of the skew-heap data structure whose depletion will prove to 
be particularly attractive. The skew-heap consists of a binary heap-ordered tree. 
The deletemin operation is performed by merging the two subtrees of the root 
(the root having been first removed). 

The merge operation proceeds by merging the right spines of the two respec- 
tive trees. Suppose that the right spine of one tree consists of X\, a;2, • • • , Xr, 
that the right spine of the other tree consists of j/i , j/2 , • ’ ’ > Vs (in order, starting 
with the respective tree roots), and that ui, U2, • • • , Vt {t = r + s) gives the result 
of merging the two spines. Suppose that (say) the sequence of Xi’s is exhausted 
prior to the sequence of yi’s in this merged result. Let Vh = Xr, h < t he the 
point at which the Xi’s are exhausted. Then in the resulting merged tree the Vi’s 
comprise a path starting at the root; the links joining t>i, U2, • • • , Vh+i are all 
left links, and the remaining links (those joining Vh+i, ■ ■ ■ , Vt) all consist of right 
links. In other words, as merging takes place in top-down order, subtrees are 
swapped, but not below the level of the first node beyond the exhausted sublist. 
An alternative, recursive formulation of the merge operation is given as follows. 
Let u and v be the two trees being merged and let merge(u,u) denote the result 
of the merging. If v is empty, then merge(u,u) is given by u. Otherwise, assume 
that the root r of u wins its comparison with the root of v and let ul and ur be 
the left and right subtrees of u, respectively. Then merge(u,u) has r as its root, 
ul as its right subtree, and merge(ufl,u) as its left subtree. 

Next, we describe the shadow data structure S given by the depletion trans- 
form of the skew-heap. This data structure bears striking similarity to the pairing 
heap; only the sequence of pairings performed in the course of a deletemin oper- 
ation differ. An individual priority queue is represented as a single heap-ordered 
tree. Considering the deletemin operation, let j/i, 3/2, ■ • ■ , Um be the children of 
the tree root being deleted, in left-to-right order. The deletemin operation pro- 
ceeds to combine these nodes as follows. First, a right-to-left incremental pair- 
ing pass is performed among the nodes in odd numbered positions (the nodes 
3/1 j 3/3) 3/5) ■ • ■)• bet Sodd denote the resulting tree. Second, a right-to-left incre- 
mental pairing pass is performed among the nodes in even numbered positions 
(the nodes 3/2, 3/4, 3/6 • • •)• bet Seven denote the resulting tree. Finally, Sodd and 
Seven are paired. (If Seven is empty, then the final result is given by Sodd-) We 
refer to this shadow data structure as a skew-pairing heap. 

Theorem 1. The depletion transform, when applied to the skew-heap data struc- 
ture, induces the skew-pairing heap data structure. 

(Proof omitted due to space limitation.) 

There are two arguable advantages of the skew-pairing heap relative to the 
skew-heap. First, we avoid the necessity of swapping subtrees as the operations 
take place. Second, and more fundamentally, the skew-pairing heap provides for 
efficient implementation of decrease-key operations, as described next. 
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Because the skew-pairing heap can be viewed as a new variant of pairing 
heap, decrease-key operations can be implemented, as is the case with the usual 
pairing heap, by removing the subtree of the affected node (assuming that it is 
not the root) and then linking it with the root. 

It is readily shown that the O(logn) amortized bounds derived for pairing 
heaps extend to this data structure. Indeed, the same potential function used in 
the original analysis of pairing heaps 0 applies here. Moreover, in practice we 
can expect that the decrease-key operation would be considerably more efficient 
than suggested by this bound, although constant amortized cost is precluded as 
a consequence of a lower bound established for a large family data structures 
that generalize the pairing heap Q. (Roughly speaking, this bound shows that 
decrease- key operations have an intrinsic cost of 17 (log log n), n = heap size, 
for any self-adjusting variation of the pairing heap that restricts comparisons to 
take place among root nodes of trees in the forest.) 

3 Improved Pairing Heaps 

The perspective gained by consideration of the depletion transform motivates 
certain practical improvements for pairing heaps, which we proceed to describe. 
Returning to the example of the tournament tree, discussed at the beginning 
of the preceding section, two observations come to mind. First, an alternative 
implementation of insertion for this data structure can be entertained, and pro- 
ceeds as follows. We choose a particular leaf of the tree and expand it, replacing 
it with an internal node having two leaves as children. In one of the new leaves 
we store the new value being inserted, and in the other we store the value found 
in the original leaf. We then proceed in a bottom-up manner, restoring the con- 
ditions that define a tournament tree. The advantage of this implementation 
relative to that based on merging (as described at the beginning of the previous 
section) is that the length of just one path from the root increases, instead of 
all paths having their lengths increase. 

Now which leaf should we choose? Keeping in mind that we want our data 
structure to be faithfully depletable, one possibility is suggested by considering 
a particular implementation of changemin(a;), which replaces the minimum pri- 
ority queue value with x. Let i be the leaf of the tournament tree that stores 
the minimum value. A natural implementation of changemin(a;) first replaces 
the value in i with the value x, and then reestablishes appropriate values on the 
path leading to the root. Viewed in terms of depletion, this operation translates 
as follows. We remove the root of the shadow tree, and then proceed with a 
right-to-left incremental pairing pass, but first position a new node containing x 
to the right of the existing children of the root. 

This suggests that to implement insertion in the shadow data structure, 
we add a new singleton node tree to the forest. When executing a deletemin 
operation, we treat the inserted element, if it is not minimal, as a rightmost child 
of the root. As a concept, this also seems sensible in the context of the usual 
pairing heap data structure, particularly when combined with the auxiliary area 
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method of Stasko and Vitter [^. As originally described [^, the auxiliary area 
method works as follows. 

A deletemin operation is always completed with a single tree remaining, 
referred to as the main tree. Between successive deletemin operations, the nodes 
and subtrees resulting from insertions and decrease-key operations are stored in 
what is referred to as the auxiliary area. When the next deletemin operation 
takes place, the trees in the auxiliary area are coalesced into a single tree using 
what is referred to as the multipass method Q (to be described shortly). This 
tree is then combined with the main tree using a linking operation. The root 
of the resulting tree is then removed, and its subtrees are then coalesced in the 
same manner as for the usual two-pass pairing heap 

The multipass method for coalescing a list of trees begins by linking the 
trees in pairs. This is now repeated for the list of trees resulting from this first 
pass (of which there are half as many as initially present), and then repeated 
again, as necessary, until a single tree remains. (An alternative and preferable 
implementation places the result of each linking at the end of a queue consisting 
of the trees being linked, proceeding with the linkings in a round robin manner 
until a single tree remains.) 

Now we modify the above by maintaining two auxiliary forests: one for the 
subtrees that result from decrease- key operations, and the other for newly in- 
serted items. When executing a deletemin operation, we apply the multipass 
method to both auxiliary areas, obtaining two trees, a decrease-key tree and an 
insertion tree. Second, we pair the decrease-key tree with the main tree. Call this 
the augmented main tree. Third, we compare the roots of the insertion tree with 
that of the augmented main tree. If the root of the insertion tree wins the com- 
parison, we link the two trees and proceed with the root removal and subsequent 
pairing passes in the usual way. On the other hand, if the root of the insertion 
tree loses the comparison, then we do not link the two trees together. Instead, 
we remove the main tree root and proceed with the pairing passes, treating the 
insertion tree as though it is the rightmost subtree of the root. In the sequel we 
refer to this modification as the insertion heuristic. 

Next, we turn to our second observation. Viewed in terms of a tournament 
tree, the first pass of the usual (two-pass) deletemin algorithm for pairing heaps 
restructures the tournament tree by performing a rotation on every second link 
along the path leading to the root, thereby “halving” this path. In the long run, 
this restructuring keeps the tree reasonably balanced. However, for a tree that 
happens to be highly balanced this path halving operation is somewhat harsh 
and may even worsen the overall balance of the tree. This motivates our second 
modification. Only when the number of children of the root (in the pairing heap) 
is relatively large should the path halving pass take place. Experimentally, a good 
dividing point seems to be when the number of children exceeds 1.2 -Ign, where n 
is the number of items in the priority queue. (Note that the quantity 1.2 - Ign can 
be maintained with minimal overhead as priority queue operations take place.) 
In the sequel we refer to this modification as the second-pass heuristic. 
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4 Experimental Findings 

Our experments fall into two basic categories. With respect to the first category, 
the priority queue is utilized to sort data inserted into an initially empty priority 
queue, accomplished by repeated execution of the deletemin operation. With 
respect to the second category, after building a priority queue of a specified size, 
the remaining operations are subdivided into multiple rounds, where each round 
consists of a specified number of decrease- key operations, a single insertion, and 
a single deletemin operation. A round of operations leaves the priority queue 
size unchanged. All of our experimental results are reported in terms of average 
number of comparisons per deletemin operation. 

As noted by Jones Q, with respect to the second category of experiments, 
the initial “shape” of the priority queue (upon reaching its steady-state size) can 
influence subsequent performance. Most of our category 2 experiments (those 
reflected in figures Qa),Qb),QQD and Q build the initial priority queue 
configuration by repeated execution of the pattern: insert, insert, deletemin; 
until the steady-state size is reached. (We return to this issue below.) 

Our experiments involve both numerical and adversarial data. For simplicity, 
we implement a decrease-key operation so that no change in the affected data 
value takes place. Our numerical experiments (considered first) adhere to the 
negative exponential distribution model outlined in Jones the next data 
value to be inserted is chosen to be the previously deleted value plus an offset 
6 = — In(rand), where rand is a random variable uniformly distributed over the 
interval (0,1). In the steady-state the next data item to be inserted has the same 
distribution as the items remaining in the priority queue Q. The items chosen 
for decrease- key operations are randomly selected from the priority queue. 

For the sorting experiments it suffices to utilize random numerical data values 
uniformly distributed over the interval (0,1) since this induces a uniform distri- 
bution on the corresponding order permutations. We note that for comparison- 
based sorting algorithms, the induced distribution on the order permutations is 
the only determinant of performance, when measured in comparisons. 

The choice of the coefficient 1.2, appearing in the threshold 1.2 Ign for the 
first-pass heuristic, can be a matter of fine tuning, and in making this choice 
we use test runs that utilize numerical rather than adversarial data. FigureQa) 
typifies the considerations involved. Case A plots average deletemin cost versus 
coefficient choice for steady-state priority queue size 60000, with 0 decrease-key 
operations per round. Case B represents steady-state priority queue size 10000, 
with 2 decrease- key operations per round. 

We choose the “two-pass auxiliary area” variation of pairing heap (de- 
scribed above) as our starting point. In the sequel this is referred to as the basic 
data structure. We refer to the data structure described in the preceding sec- 
tion, employing the first-pass and insertion heuristics, as the augmented data 
structure. Our experiments are primarily designed to compare the performance 
of the basic data structure with that of the augmented data structure. 

Our experiments involving adversarial data utilize an adversary defined by 
the rule that when a comparison takes place between two tree roots, the root 
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of the larger tree wins the comparison. This outcome yields the least amount 
of information for a given comparison, in effect maximizing the number of re- 
maining order permutations consistent with the data structure. (An exception, 
to ensure consistency of the adversary, is that the single tree root at the onset 
of a given round wins all comparisons in which it participates, and then gets 
deleted in the deletemin operation that concludes the round.) We also confine 
the decrease-key operations of a given round to the children of the root node 
at the onset of the round to ensure consistency of the adversary. Among the 
children of the root, the ones selected for decrease- key operations are chosen in 
accordance with the following criterion. When two nodes A and B are linked, 
with B becoming the child of A, we define the efficiency of the linking to be 
size(i3)/size(A), where the size of a node is defined as the size of the tree rooted 
at the node (prior to the linking) . With our given adversary this efficiency never 
exceeds 1 . We select for decrease- key operations those children of the root whose 
linkings to the root have the highest efficiency values (defined at the moment 
of the linking) and perform these decrease-key operations in random order. The 
number of children chosen in a given round is a parameter of the experiment. 
(This selection criterion is designed to maximize loss of information [].) 

In contrasting experiments based on numerical data with those based on the 
use of our adversary, we note that the latter is intended to elicit poor behavior, 
whereas the former is intended to elicit typical behavior. With respect to exper- 
iments involving the adversary, the nodes selected for decrease-key operations 
reflect the data structure being tested, whereas for the numerical experiments, 
these choices are independent of the data structure. Figure 0b) contrasts these 
two forms of testing. Both of the plots appearing in this figure reflect 8 decrease- 
key operations per round, and both reflect the augmented data structure. For 
this and all of remaining figures (except for flgures0and0 , our plots reflect runs 
involving steady-state priority queue sizes n = 3^, for 6 < fc < 12 (3® = 729, 
3^^ = 531441), and average deletemin cost is computed over 3n deletemin oper- 
ations (3n rounds per run). 

Figure 0 exhibits the results of our sorting runs. Data sets of size 3^ for 
6 < fc < 12 are sorted, and average deletemin costs for the respective data set 
sizes are shown in the plots. With respect to numerical data, we And that the 
basic data structure requires 20% more comparisons than the augmented data 
structure. 

As described above, our category 2 experiments involve an initial data struc- 
ture configuration obtained by repeated executions of the pattern: insert, insert, 
deletemin; until the steady-state size is reached. Compared with an initial con- 
figuration obtained exclusively by repeated insertions, this seems to present a 
more realistic setting for our priority queue algorithms; the alternative configu- 
ration has extremely good initial balance, perhaps likely to result in unusually 
good subsequent performance. FiguresQc) and (d) exhibit this effect. The plots 
labeled “balanced” concerns runs that are based on an initial phase consisting 
exclusively of insertions. The plots labeled “unbalanced” involves runs based on 
the (insert, insert, deletemin) building phase, employed in our subsequent ex- 
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periments. (None of the runs depicted in these two figures include decrease-key 
operations.) Considering the runs involving numerical data, using the basic data 
structure, the results are indistinguishable with respect to choice of initial config- 
uration. In contrast, the runs involving the augmented data structure are more 
sensitive to the initial configuration, particularly in the presence of adversarial 
data. 



numerical data adversarial data 





Fig. 4. Sorting data sets 



The results of our category 2 experiments are shown in figures QandQ which 
involve, respectively, numerical and adversarial data. Our category 2 numerical 
experiments reveal that the basic data structure performs 27% more comparisons 
than the augmented data structure when no decrease key operations occur. In 
the presence of decrease-key operations, the two data structures exhibit similar 
performance. This raises the following question. Might it be that after a pro- 
longed phase characterized by a high frequency of decrease-key operations, the 
augmented data structure permanently enters a state from which it no longer 
enjoys a performance advantage relative to the basic data structure, even if there 
are no subsequent decrease-key operations? Fortunately, FigureQsuggests that 
this is not the case. The figure concerns a run involving numerical data and a 
steady-state priority queue size of 30000. After the initial building phase, 270000 
rounds are executed. The middle 90000 rounds contain 8 decrease- key operations 
per round; the first and last 90000 rounds contain no decrease-key operations. 
The 270000 rounds are subdivided into intervals of 1000 rounds each, and the 
average deletemin costs over the respective intervals are computed and plotted. 
As seen from the plot, after completion of the middle phase of the computa- 
tion, the data structure quickly recovers its prior level of performance. This is 
similarly the case with respect to adversarial data (not shown) . 

The category 2 experiments exhibit a larger performance gap between the 
augmented data structure and the basic data structure as compared with the 
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category 1 (sorting) experiments (27% versus 20%), reflecting the fact that the 
insertion heuristic does not come into play in the category 1 experiments. 

A flnal issue we address concerns the skew-pairing mechanism. With respect 
to the preceding experiments the augmented data structure employs the strategy 
that upon removing the minimal root in the course of a deletemin operation, the 
children of that root are then combined using the two-pass method (provided 
that their number exceeds 1.2 Ign). An alternative would be to combine these 
children in the manner of the skew-pairing heap, as described above. Unfortu- 
nately, this rarely improves performance. Figure 0 contrasts the two strategies. 
Only when operating upon adversarial data and in the absence of decrease-key 
operations does the skew-pairing strategy improve upon the two-pass strategy, 
and then only slightly (flgure0b)). 

5 Concluding Remarks 

We have presented a priority queue transform, depletion, and have exhibited 
two applications. The first extends the capabilities of the skew-heap, converting 
it into a new variant of pairing heap. The second introduces a perspective on 
pairing heaps under which certain heuristic improvements are intuitively moti- 
vated. Experimental results concerning these heuristics have been presented and 
appear promising. 

Our transform effects a reordering of atomic operations, comparisons in this 
instance. Other interesting applications of this, and other analogous transforms 
may exist. 

As is the case with the other variations of pairing heaps, an asymptotic 
determination of the amortized complexities of the skew-pairing heap is an open 
problem. 
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Fig. 5. Constant heap size; various decrease-key frequencies (numerical data) 
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Fig. 6. Constant heap size; various decrease-key frequencies (adversarial data) 
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Fig. 7. Middle phase with 8 decrease- key operations per round. 
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Fig. 8. skew-pairing versus two-pass: (a) numerical data, (b) adversarial data 




